Searching COVID-19 Clinical Research Using Graph Queries: Algorithm Development and Validation (2310.04094v2)
Abstract: Objective: This study aims to consider small graphs of concepts and exploit them for expressing graph searches over existing COVID-19-related literature, leveraging the increasing use of graphs to represent and query scientific knowledge and providing a user-friendly search and exploration experience. Methods: We considered the COVID-19 Open Research Dataset corpus and summarized its content by annotating the publications' abstracts using terms selected from the UMLS and the Ontology of Coronavirus Infectious Disease. Then, we built a co-occurrence network that includes all relevant concepts mentioned in the corpus, establishing connections when their mutual information is relevant. A sophisticated graph query engine was built to allow the identification of the best matches of graph queries on the network. It also supports partial matches and suggests potential query completions using shortest paths. Results: We built a large co-occurrence network, consisting of 128,249 entities and 47,198,965 relationships; the GRAPH-SEARCH interface allows users to explore the network by formulating or adapting graph queries; it produces a bibliography of publications, which are globally ranked; and each publication is further associated with the specific parts of the query that it explains, thereby allowing the user to understand each aspect of the matching. Conclusions: Our approach supports the process of query formulation and evidence search upon a large text corpus; it can be reapplied to any scientific domain where documents corpora and curated ontologies are made available.
- A machine-generated view of the role of blood glucose levels in the severity of COVID-19. Frontiers in Public Health. 2021:1068.
- A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2. Cell Host & Microbe. 2020;27(4):671-80.
- Generation and transmission of interlineage recombinants in the SARS-CoV-2 pandemic. Cell. 2021;184(20):5179-88.
- Tracking mutational semantics of SARS-CoV-2 genomes. Scientific Reports. 2022;12:15704.
- Ware C. Visual queries: The foundation of visual thinking. In: Knowledge and information visualization: Searching for synergies. Springer; 2005. p. 27-35.
- CORD-19: The COVID-19 Open Research Dataset. arXiv preprint arXiv:200410706. 2020.
- LitCovid: an open database of COVID-19 literature. Nucleic Acids Research. 2021;49(D1):D1534-40.
- Outbreak.info Research Library: A standardized, searchable platform to discover and explore COVID-19 resources. Nature Methods. 2023;20(4):536-40.
- Kejriwal M. Knowledge graphs and COVID-19: opportunities, challenges, and implementation. Harvard Data Science Review. 2020;(Special Issue 1).
- Expediting knowledge acquisition by a web framework for Knowledge Graph Exploration and Visualization (KGEV): case studies on COVID-19 and Human Phenotype Ontology. BMC Medical Informatics and Decision Making. 2022;22(Suppl 2):147.
- Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Research. 2004;32(Suppl 1):D267-70.
- CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis. Scientific Data. 2020;7:181.
- Rose ME, Kitchin JR. pybliometrics: Scriptable bibliometrics using a Python interface to Scopus. SoftwareX. 2019;10:100263.
- Silva D, Rohatgi S. semanticscholar; 2023. Last accessed online: Sept 15th, 2023. https://github.com/danielnsilva/semanticscholar.
- Church KW, Hanks P. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics. 1990;16(1):22-9.
- Bouma G. Normalized (pointwise) mutual information in collocation extraction. Proceedings of Conferences of the German Society for Computational Linguistics and Language Technology (GSCL). 2009;30:31-40.
- Cramér H. Mathematical methods of statistics. vol. 43. Princeton university press; 1999.
- MariaDB Foundation. MariaDB; 2023. Last accessed online: Sept 15th, 2023. https://mariadb.org/.
- Neo4j. Neo4j Graph Database; 2023. Last accessed online: Sept 15th, 2023. https://neo4j.com/.
- Neo4j. Neo4j Graph Data Science; 2023. Last accessed online: Sept 15th, 2023. https://github.com/neo4j/graph-data-science.
- Finding top-k similar graphs in graph databases. In: Proceedings of the 15th International Conference on Extending Database Technology; 2012. p. 456-67.
- Genetic mechanisms of critical illness in COVID-19. Nature. 2021;591(7848):92-8.
- Increased LPS levels coexist with systemic inflammation and result in monocyte activation in severe COVID-19 patients. International Immunopharmacology. 2021;100:108125.
- Shortest-Path Graph Kernels for Document Similarity. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics; 2017. p. 1890-900.
- G-Bean: an ontology-graph based web tool for biomedical literature retrieval. BMC Bioinformatics. 2014;15(Suppl 12):S1.
- Learning for biomedical information extraction: Methodological review of recent advances. arXiv preprint arXiv:160607993. 2016.
- A pre-training and self-training approach for biomedical named entity recognition. PloS One. 2021;16(2):e0246310.
- Ontology-based semantic similarity approach for biomedical dataset retrieval. In: Health Information Science: 9th International Conference, HIS 2020, Amsterdam, The Netherlands, October 20–23, 2020, Proceedings 9. Springer; 2020. p. 49-60.
- PaperBot: open-source web-based search and metadata organization of scientific literature. BMC Bioinformatics. 2019;20:50.
- Query expansion with a medical ontology to improve a multimodal information retrieval system. Computers in Biology and Medicine. 2009;39(4):396-403.
- Ontology graph based query expansion for biomedical information retrieval. In: 2011 IEEE International Conference on Bioinformatics and Biomedicine. IEEE; 2011. p. 488-93.
- Enriching Contextualized Language Model from Knowledge Graph for Biomedical Information Extraction. Briefings in Bioinformatics. 2021 May;22(3):bbaa110.
- AMMU: a survey of transformer-based biomedical pretrained language models. Journal of Biomedical Informatics. 2022;126:103982.
- LitCovid in 2022: an information resource for the COVID-19 literature. Nucleic Acids Research. 2023;51(D1):D1512-8.