Emergent Mind

Abstract

In many scientific fields, LLMs have revolutionized the way with which text and other modalities of data (e.g., molecules and proteins) are dealt, achieving superior performance in various applications and augmenting the scientific discovery process. Nevertheless, previous surveys on scientific LLMs often concentrate on one to two fields or a single modality. In this paper, we aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs regarding their architectures and pre-training techniques. To this end, we comprehensively survey over 250 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality. Moreover, we investigate how LLMs have been deployed to benefit scientific discovery. Resources related to this survey are available at https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models.

Three scientific LLM pre-training techniques: masked language modeling, next token prediction, and contrastive learning.

Overview

  • The paper provides an extensive survey of over 250 scientific LLMs, detailing their architectures, training methodologies, and cross-field applications.

  • It categorizes pre-training techniques into three major strategies: Masked Language Modeling, Next Token Prediction, and Contrastive Learning for Multi-Modal Data.

  • The survey explores LLM applications across various scientific domains such as general science, mathematics, physics, chemistry, biology, and environmental science, detailing specific models and their evaluation tasks.

A Comprehensive Survey of Scientific LLMs and Their Applications in Scientific Discovery

The advancement of LLMs has significantly transformed numerous scientific domains, facilitating the manipulation and analysis of text and other forms of data (e.g., molecules and proteins). This paper presents an extensive survey of over 250 scientific LLMs, detailing their architectures, training methodologies, and applications across various scientific fields. The survey goes beyond previous reviews by bridging cross-field and cross-modal connections, thereby offering a holistic view of the landscape.

Overview of Pre-Training Techniques and Architectures

The paper categorizes scientific LLM pre-training techniques into three major strategies:

  1. Masked Language Modeling (MLM) for Encoder Models: Inspired by BERT and RoBERTa, this approach sequentializes input data, which can be text, academic graphs, molecular sequences, or biological sequences. The models learn by predicting masked tokens within these sequences.
  2. Next Token Prediction for Encoder-Decoder Models: Following the methodologies of models like GPT and LLaMA, this technique also includes instruction tuning. The input data can be text, tables, images, or crystal data, converted into a sequence format that the model processes to predict subsequent tokens.
  3. Contrastive Learning for Multi-Modal Data: This strategy uses multiple encoders to map different but relevant data closer in latent space. The technique is applicable to combinations like text-text, text-protein, text-graph, and text-image.

Scientific Fields and Modalities Explored

The survey dives deep into several scientific domains, summarizing the prominent models, their training data, and evaluation tasks:

  • General Science: Models like SciBERT and Galactica use bibliographic databases for pre-training and are evaluated on tasks such as named entity recognition (NER), relation extraction (RE), and question answering (QA). Integration of graph data via models like SPECTER and SciPatton enhances tasks related to paper-paper relationships.
  • Mathematics: Mathematical LLMs leverage datasets like MathQA and GSM8K for pre-training. Models like Minerva and MetaMath specialize in answering mathematical questions, handling math word problems (MWP), and performing quantitative reasoning.
  • Physics: AstroLLMs, such as AstroBERT and AstroLLaMA, focus on the astronomy domain. These models are trained on extensive datasets from arXiv and are tasked with NER and recommendation systems.
  • Chemistry and Materials Science: Here, models like ChemBERT and MatSciBERT are commonly pre-trained with journal articles and databases. Recent models incorporate graph and image modalities for tasks including molecule generation, reaction prediction, and retrosynthesis.
  • Biology and Medicine: Leveraging datasets from PubMed and other medical databases, models like BioBERT and Med-PaLM target numerous tasks from NER and RE to complex medical QA. The integration of vision and graph data further extends the applicability to tasks such as medical report generation and multi-hop reasoning.
  • Geography, Geology, and Environmental Science: Models such as ClimateBERT and UrbanCLIP integrate POI and satellite imagery data to tackle tasks related to climate forecasting and urban planning.

Implications and Future Directions

The research underscores the transformative potential of LLMs in scientific discovery. These models can automate various stages of the research process, from hypothesis generation to experiment design and result analysis. For instance, math LLMs assist in theorem proving and protein LLMs like ESM-2 facilitate protein structure prediction, both streamlining and accelerating the discovery process.

Challenges and Future Directions:

  1. Handling Fine-Grained Themes: Many current models target broad scientific fields, potentially overlooking specialized knowledge areas. Future models could benefit from focusing on fine-grained themes and leveraging detailed knowledge graphs.
  2. Generalizing to Out-of-Distribution Data: Given the dynamic nature of scientific research, models need better generalization capabilities for new and unseen data. Techniques from invariant learning can be explored to address this challenge.
  3. Ensuring Trustworthy Predictions: Hallucination in LLMs, especially in high-stakes domains like biomedicine, remains a critical issue. Cross-modal retrieval-augmented generation, providing models with robust and reliable external information, could enhance prediction accuracy.

Conclusion

This survey provides a comprehensive analysis of the current state of scientific LLMs, showcasing their applications and shared methodologies across different fields and modalities. Highlighting existing challenges and future paths fosters further exploration, aiming to refine these models for more efficient, accurate, and reliable scientific endeavors.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.