Emergent Mind

MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension

(2407.04903)
Published Jul 6, 2024 in cs.CL , cs.AI , and cs.CV

Abstract

The rapid advancement of LLMs and Large Multimodal Models (LMMs) has heightened the demand for AI-based scientific assistants capable of understanding scientific articles and figures. Despite progress, there remains a significant gap in evaluating models' comprehension of professional, graduate-level, and even PhD-level scientific content. Current datasets and benchmarks primarily focus on relatively simple scientific tasks and figures, lacking comprehensive assessments across diverse advanced scientific disciplines. To bridge this gap, we collected a multimodal, multidisciplinary dataset from open-access scientific articles published in Nature Communications journals. This dataset spans 72 scientific disciplines, ensuring both diversity and quality. We created benchmarks with various tasks and settings to comprehensively evaluate LMMs' capabilities in understanding scientific figures and content. Our evaluation revealed that these tasks are highly challenging: many open-source models struggled significantly, and even GPT-4V and GPT-4o faced difficulties. We also explored using our dataset as training resources by constructing visual instruction-following data, enabling the 7B LLaVA model to achieve performance comparable to GPT-4V/o on our benchmark. Additionally, we investigated the use of our interleaved article texts and figure images for pre-training LMMs, resulting in improvements on the material generation task. The source dataset, including articles, figures, constructed benchmarks, and visual instruction-following data, is open-sourced.

Benchmark and visual instruction-following data construction example in MMSci.

Overview

  • The MMSci paper introduces a dataset designed to evaluate and enhance multimodal models in comprehending complex, PhD-level scientific literature across 72 disciplines.

  • Two main tasks, Scientific Figure Captioning and Visual Question Answering (VQA), were used to rigorously benchmark different models, revealing that models with full context perform best.

  • The dataset also serves as a training resource, improving models like LLaVA and GPT-4V, and showcasing the dataset's value through a case study on generating novel crystal structures.

Essay on "MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension"

This paper presents MMSci, a novel dataset meticulously curated to facilitate the evaluation and enhancement of Large Multimodal Models (LMMs) in comprehending advanced, multimodal scientific literature. This dataset encompasses peer-reviewed articles and figures from 72 distinct scientific disciplines, making it both diverse and robust for rigorous assessments of LMM capabilities.

The motivation for MMSci stems from the rapid advancements in LLMs and LMMs, which, while successful at elementary to undergraduate-level tasks, often falter when tasked with understanding PhD-level scientific content. MMSci addresses this gap by providing not only a challenging evaluation benchmark but also substantial training resources to enhance model performance.

Dataset and Benchmark Construction

The MMSci dataset was gathered from high-quality, open-access articles published in Nature Communications journals, ensuring authenticity and scholarly reliability. The dataset spans five major categories, including diverse subjects like materials science, ecology, molecular biology, and more. The collected data, which includes titles, abstracts, full articles, figures, and captions, underwent meticulous regular expression matching to accurately segment sub-figures and their corresponding sub-captions from complex multipanel figures.

In addition to the dataset, a comprehensive benchmark was constructed to evaluate LMMs rigorously. This benchmark comprises two primary tasks: Scientific Figure Captioning and Visual Question Answering (VQA), each with multiple settings to test various aspects of model comprehension:

  • Ungrounded, Abstract-grounded, and Full-content-grounded figure captioning: Models generate captions with varying degrees of contextual information.
  • Multiple-choice VQA settings: Models select correct captions or sub-captions from figures, testing their understanding of both figures and context.

Evaluation Results

The evaluation of prevalent open-source and proprietary LMMs reveals significant insights:

  • Scientific Figure Captioning: Models provided with full article context (GPT-4o) achieved the best METEOR and ROUGE scores, highlighting the necessity of comprehensive context for accurate figure interpretation. However, open-source models like LLaVA-Next demonstrated markedly lower performance, underscoring the challenges inherent in this task.
  • VQA Performance: Significant disparities were evident between models, with proprietary models (e.g., GPT-4V, GPT-4o) outperforming open-source counterparts, particularly when employing Chain-of-Thought (CoT) reasoning, which enhanced model accuracy by a substantial margin.

Training Resources and Enhancements

To address identified deficiencies, the authors explored the MMSci dataset as a training resource:

  • Visual Instruction-Following Data: This dataset is constructed to discuss figure content through a series of single or multi-turn interactions, reflecting real-world conversations about scientific figures.
  • Interleaved Text and Image Data for Pre-training: Articles and figures are interleaved to create a cohesive training corpus. Fine-tuning models on this dataset (e.g., 7B LLaVA model) yielded performance enhancements comparable to proprietary models like GPT-4V.

Case Study on Material Generation

A highlight of the paper is the case study demonstrating the efficacy of continuous pre-training on MMSci. Utilizing this approach, the LLaMA2-7B model demonstrated improved stability and validity in generating novel crystal structures, essential tasks in materials science. This signifies the benefit of scientifically enriched training data, infusing the model with domain-specific knowledge that enhances its generative capabilities.

Implications and Future Directions

The implications of this research are manifold. Practically, MMSci enables the development of more capable and reliable AI assistants for scientific research, potentially automating parts of the research process such as literature review and data analysis. Theoretically, it provides insights into the integration of multimodal data within AI systems, furthering our understanding of how these systems can interpret and generate scientific content.

Future research directions could involve expanding the dataset to include more diverse forms of scientific content, such as supplementary materials and experimental datasets, or refining the evaluation metrics to capture nuanced aspects of model performance. The development of methods to seamlessly integrate multimodal pre-training with downstream task fine-tuning will also be pivotal.

Conclusion

MMSci stands as a significant contribution to the field of scientific AI, providing both a rigorous evaluation benchmark and valuable training resources. It bridges the gap in current model evaluations by focusing on PhD-level content and offers a path towards enhancing LMM capabilities in comprehending complex scientific literature. This work underscores the necessity of context-rich, diverse datasets in developing advanced AI solutions for academic and scientific endeavors.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.