Evaluation of large language models for discovery of gene set function (2309.04019v2)

Published 7 Sep 2023 in q-bio.GN, cs.AI, cs.CL, and q-bio.MN

Abstract: Gene set analysis is a mainstay of functional genomics, but it relies on curated databases of gene functions that are incomplete. Here we evaluate five LLMs for their ability to discover the common biological functions represented by a gene set, substantiated by supporting rationale, citations and a confidence assessment. Benchmarking against canonical gene sets from the Gene Ontology, GPT-4 confidently recovered the curated name or a more general concept (73% of cases), while benchmarking against random gene sets correctly yielded zero confidence. Gemini-Pro and Mixtral-Instruct showed ability in naming but were falsely confident for random sets, whereas Llama2-70b had poor performance overall. In gene sets derived from 'omics data, GPT-4 identified novel functions not reported by classical functional enrichment (32% of cases), which independent review indicated were largely verifiable and not hallucinations. The ability to rapidly synthesize common gene functions positions LLMs as valuable 'omics assistants.

Citations (13)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs can analyze gene sets, with GPT-4 recovering curated or generalized functions 73% of the time.
It employs a pipeline integrating semantic similarity measures and SapBERT validation to objectively assess gene set naming and confidence scores.
The study reveals GPT-4’s ability to identify novel gene functions, uncovering 32% of insights missed by classical enrichment analysis.

Evaluation of LLMs for Discovery of Gene Set Function

The paper "Evaluation of LLMs for Discovery of Gene Set Function" presents a comprehensive exploration of the potential for LLMs to assist in functional genomics by offering automated analyses of gene set functions. The paper evaluates five LLMs—GPT-4, Gemini-Pro, Mixtral-Instruct, Llama2-70b, and GPT-3.5—in determining the biological functions represented by gene sets. Using a robust benchmarking framework, the research provides important insights into the application and capability of these models within the field of genomics.

Summary of Results

The research constructs a functional genomics pipeline wherein LLMs analyze gene sets, generate descriptive names, and provide confidence scores alongside analyses. When benchmarked against canonical gene sets from the Gene Ontology (GO), GPT-4 proved competent, recovering the curated name or a more generalized concept 73% of the time. Moreover, in the context of novel gene sets derived from 'omics data, GPT-4 identified novel functions absent in classical enrichment analysis 32% of the time, suggesting its potential for novel functional identification.

The performance of each LLM varied significantly: Gemini-Pro and Mixtral-Instruct were efficient with naming conventions but displayed false confidence in random scenarios. Llama2-70b underperformed overall. The paper's evaluation of confidence assessments and annotation accuracies revealed GPT-4's superior ability in discerning noise from signal, particularly in recognizing the randomness of gene sets (87% zero-confidence attribution).

Methodological Insights

The process established for evaluating LLMs involves parsing biological literature and synthesizing possible functions based on embedded biological data. Semantic similarity measures were employed to compare the LLM-generated names to those officially documented in the Gene Ontology. Additionally, a SapBERT model aided in objectively assessing semantic similarities, providing an external validity check on LLM output.

Practical and Theoretical Implications

This paper underscores the potential utility of LLMs in rapidly interpreting gene sets and illuminating novel genomic functions that classical methods may overlook. However, given the nuances of biological data, additional reference validation processes remain necessary to counteract occasional model hallucinations, a prevalent issue with AI-generated outputs of high complexity. Furthermore, the findings encourage the development of hybrid strategies merging traditional statistical enrichment with model-based reasoning, which might offer more holistic insights into gene functions.

Future Developments

Moving forward, expanding the contextual depth of LLMs by integrating disease-specific or experiment-specific metadata into queries could enhance model output specificity. Researchers are tasked with crafting more sophisticated prompting strategies and perhaps orchestrating LLM interactions with external data sources, refining the utility of such models in functional genomics. Additionally, future studies could refine ways to incorporate the biological context into LLM analysis, perhaps by encoding experimental conditions or disease states which might affect gene interactions differently.

In conclusion, this paper makes a pertinent contribution to the burgeoning field of computational biology by showcasing the potential of LLMs not only in recapitulating known gene set functions but also in uncovering novel ones. Its rigorous approach provides a pathway for future research, emphasizing the enhancement of AI tools for genomic discovery while balancing AI innovations with the proven reliability of existing scientific methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - idekerlab/llm_evaluation_for_gene_set_interpretation: Code space for 'Evaluation of large language models for discovery of gene set function' (36 stars)

Tweets

https://twitter.com/MariosGeorgakis/status/1862615730040005098

https://twitter.com/KNM/status/1775266354964439496

https://twitter.com/DaleYuzuki/status/1776629511360249948

https://twitter.com/Clarararara_hu/status/1814721075949019533

https://twitter.com/fmkz___/status/1815635404927950885

https://twitter.com/Clarararara_hu/status/1812502082424901768