In Search For A Cure: Recommendation with Knowledge Graph on CORD-19
Tutors
- Le Zhang, Microsoft
- Iris Shen, Microsoft
- Jianxun Lian, Microsoft
- Chieh-Han Wu, Microsoft
- Miguel González-Fierro, Microsoft
- Andreas Argyriou, Microsoft
Introduction
The Coronavirus (COVID-19) that for weeks pressingly called for research that helps establish containment of the pandemic from academia’s perspective. The CORD-19, or COVID-19 Open Research Dataset [8], was launched in March 2020, in response to a request from the White House’s office of Science and Technology Policy. The collaboration from AI2, Microsoft, the NLM at the NIH, and other prestigious research institutes aims at empowering the world’s AI researchers with a text and data mining tools to help accelerate COVID-19 related research. CORD-19 is a free resource of over 45,000 scholarly articles, including over 33,000 with full text, about the coronavirus for use by the global research community. In CORD-19, over 90% of the publications are mapped with articles in the Microsoft Academic Graph (MAG) [9][12][13] which is an academic domain knowledge graph containing various types of scientific publication records, citation relationships, as well as authors, institutions, journals, and fields of study. This hands-on tutorial introduces the CORD-19 dataset together with an associated academic knowledge graph – a subgraph of Microsoft Academic Graph (MAG) includes 6 types of entities with one million nodes and over one million edges covering more than 100K bio-med academic concepts. The general aim of the tutorial is to empower data science and machine learning research community, especially the attendees of the tutorial session, to build their own knowledge graph aware applications with the help of MAG and CORD-19 data set. In the lab session, a particular use case of finding relevant articles for COVID-19, to help scientists in the healthcare domain to perform efficient academic research, is illustrated with hands-on examples and codes. The demonstrated knowledge graph-based recommender model as well as the industry-grade operationalization architecture is based on top of the academic research and development outcome from Microsoft Research Asia and the Microsoft Azure Global Industry team, which is mostly covered in the Microsoft/Recommender GitHub repository [2][5].
Outline
The general objective of the tutorial session to help the attendees ramp up with hands-on experience around knowledge graph, recommendation, and the application of the technologies on CORD-19 dataset. In total the tutorial will take 3 hours. Detailed session information can be found as below.
- Opening (15 mins)
- General introduction of context (3 mins)
- Briefly talk about COVID-19 outbreak and its impact on economy, society, and human beings.
- Talk about joint efforts from industry and academia to fight against COVID-19 – CORD-19 dataset for building knowledge-aware AI system that prevents further spread of the virus.
- Lab 0: Environment setup (12
mins)
- Install and configure of the required software and tool
- Test of the packages, APIs, and scripts for the later lab sessions.
- General introduction of context (3 mins)
- Module I: Basics of the dataset CORD-19 and the Knowledge Graph (25 mins)
- Introduction of CORD-19 Dataset and Academic Knowledge Graph - Microsoft Academic Graph (MAG) (10 mins)
- Lab 1: Link CORD-19 and MAG (15 mins)
- Module II: Understanding contents with the aid of the Knowledge Graph (40 mins)
- Introduction of content understanding with knowledge graph (25 mins)
- Knowledge graph (CORD-19 MAG sub-graph) size, schema and how to classify topics of a publication
- Use LanguageSimilarityAPIs to help understand topics of academic text
- Content understanding / Analytics examples
- Lab 2: Analytics of CORD-19 Sub-graph (15 mins)
- Topic distribution
- Temporal, geo, venue, and team size distribution
- Introduction of content understanding with knowledge graph (25 mins)
- Module III: Applications of Knowledge Graph: Academic Recommender Systems
(45 min)
- Introduction to academic recommendation with knowledge graph (5
min)
- Introduce the user2item recommendation which is to recommend related papers for researchers based on their citation history.
- Introduce the item2item recommendation which is given an article, to recommend a list of related papers.
- Introduction to Microsoft/Recommenders (5 min)
- Introduction on Knowledge-aware Recommender algorithms (5 min)
- Discuss the knowledge graph data in the scenario of academic recommendations.
- Discuss the Deep Knowledge-aware Network model (DKN[1]) that injects knowledge entities into document content for better recommendations.
- Lab 3 – 1: Data preparation (10 min)
- Manipulate data from raw datasets.
- Training, validating, and testing data construction.
- Lab 3 – 2: Pretraining embeddings (10 min)
- Pretrain word embeddings.
- Pretrain knowledge graph embeddings.
- Lab 3 – 3: Building knowledge-aware recommendation models (15 min)
- Use DKN to build a User2Item recommendation model.
- User DKN to build a Item2Item recommendation model.
- Lab 3 – 4 (optional): Model evaluation and comparison (5 min)
- Compare variants of DKN: with and without the knowledge entities.
- Compare DKN with a graph-based model: LightGCN[14].
- Introduction to academic recommendation with knowledge graph (5
min)
- Module IV: Operationalization of KG-based recommender system (40 mins)
- Lab 4: Explanation of the knowledge-graph based analytics on CORD-19 data
set (20 min)
- Walk through the basic tools and software (e.g., Python matplotlib) used for visualizing graph structured data.
- Illustrate the usage of the visualization tools for the built knowledge graph-based paper recommender.
- Introduction to globally distributed graph data storage for scalable queries via Gremlin (10 min)
- Introduce real-time scoring and ranking for recommending papers for users by using Kubernetes (10 min)
- Lab 4: Explanation of the knowledge-graph based analytics on CORD-19 data
set (20 min)
- Closing (15 mins)
- Present the summary of the tutorial
- Discuss about future work and discussion
- Question and answer
Link to the code
Link to the code used in the tutorial.
Instruction on module I and II.
Link to the slides
Link to the slides used in the tutorial
Other useful links
References
- Hongwei Wang et al, “DKN: Deep Knowledge-Aware Network for News Recommendation”, ACM WWW 2018.
- Andreas Argyriou, Miguel González-Fierro, and Le Zhang, “Microsoft Recommenders: Best Practices for Production-Ready Recommendation Systems”, ACM WWW 2020.
- Shen, Zhihong, and Dong, Yuxiao. From Graph to Knowledge Graph Algorithms and Applications. edX, 2019. https://www.edx.org/course/graphsto- knowledgegraphs-3
- Tang, Jie, and Dong, Yuxiao. Representation Learning on Networks: Theories, Algorithms, and Applications. WWW, 2019. https://www2019.thewebconf.org/tutorials.
- Le Zhang et al, “Building production-ready recommendation systems at scale”, KDD 2019.
- Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.
- Hamilton, Will, Zhitao Ying, and Jure Leskovec. “Inductive representation learning on large graphs.” Advances in neural information processing systems. 2017.
- CORD-19, url: https://pages.semanticscholar.org/coronavirus-research
- Microsoft Academic Graph Documentation, url: https://docs.microsoft.com/en-us/academic-services/graph/
- Microsoft Academic Graph Application on COVID-19 research, url: https://github.com/microsoft/mag-covid19-research-examples
- Understanding Documentation By Using Semantics, url: https://www.microsoft.com/en-us/research/project/academic/articles/understanding-documents-by-using-semantics/
- Wang, Kuansan, et al. “Microsoft Academic Graph: When Experts Are Not Enough.” Quantitative Science Studies, vol. 1, no. 1, 2020, pp. 396–413.
- Wang, Kuansan, et al. “A Review of Microsoft Academic Services for Science of Science Studies.” Frontiers in Big Data, vol. 2, 2019.
- Xiangnan He, et al. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. SIGIR ‘20
- Wang, Lucy Lu, et al. “CORD-19: The Covid-19 Open Research Dataset.” ArXiv Preprint ArXiv:2004.10706, 2020.
- Sinha, Arnab, et al. “An Overview of Microsoft Academic Service (MAS) and Applications.” WWW, 2015, pp. 243–246.
- Shen, Zhihong, et al. “A Web-Scale System for Scientific Knowledge Exploration.” ACL, 2018, pp. 87–92.