MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding (2407.04903v3)

Published 6 Jul 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Scientific figure interpretation is a crucial capability for AI-driven scientific assistants built on advanced Large Vision LLMs. However, current datasets and benchmarks primarily focus on simple charts or other relatively straightforward figures from limited science domains. To address this gap, we present a comprehensive dataset compiled from peer-reviewed Nature Communications articles covering 72 scientific fields, encompassing complex visualizations such as schematic diagrams, microscopic images, and experimental data which require graduate-level expertise to interpret. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Our analysis revealed significant task challenges and performance gaps among models. Beyond serving as a benchmark, this dataset serves as a valuable resource for large-scale training. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations. Furthermore, continuous pre-training on our interleaved article and figure data substantially enhanced the model's downstream task performance in materials science. We have released our dataset to support further research.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces MMSci, a curated dataset of peer-reviewed articles and figures from 72 disciplines to rigorously assess and improve LMM capabilities.
The paper establishes a comprehensive benchmark for figure captioning and visual question answering, demonstrating that context and Chain-of-Thought reasoning enhance model performance.
The paper shows that fine-tuning on interleaved text-image data yields significant performance gains, as evidenced by improved stability in material generation tasks.

Essay on "MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension"

This paper presents MMSci, a novel dataset meticulously curated to facilitate the evaluation and enhancement of Large Multimodal Models (LMMs) in comprehending advanced, multimodal scientific literature. This dataset encompasses peer-reviewed articles and figures from 72 distinct scientific disciplines, making it both diverse and robust for rigorous assessments of LMM capabilities.

The motivation for MMSci stems from the rapid advancements in LLMs and LMMs, which, while successful at elementary to undergraduate-level tasks, often falter when tasked with understanding PhD-level scientific content. MMSci addresses this gap by providing not only a challenging evaluation benchmark but also substantial training resources to enhance model performance.

Dataset and Benchmark Construction

The MMSci dataset was gathered from high-quality, open-access articles published in Nature Communications journals, ensuring authenticity and scholarly reliability. The dataset spans five major categories, including diverse subjects like materials science, ecology, molecular biology, and more. The collected data, which includes titles, abstracts, full articles, figures, and captions, underwent meticulous regular expression matching to accurately segment sub-figures and their corresponding sub-captions from complex multipanel figures.

In addition to the dataset, a comprehensive benchmark was constructed to evaluate LMMs rigorously. This benchmark comprises two primary tasks: Scientific Figure Captioning and Visual Question Answering (VQA), each with multiple settings to test various aspects of model comprehension:

Ungrounded, Abstract-grounded, and Full-content-grounded figure captioning: Models generate captions with varying degrees of contextual information.
Multiple-choice VQA settings: Models select correct captions or sub-captions from figures, testing their understanding of both figures and context.

Evaluation Results

The evaluation of prevalent open-source and proprietary LMMs reveals significant insights:

Scientific Figure Captioning: Models provided with full article context (GPT-4o) achieved the best METEOR and ROUGE scores, highlighting the necessity of comprehensive context for accurate figure interpretation. However, open-source models like LLaVA-Next demonstrated markedly lower performance, underscoring the challenges inherent in this task.
VQA Performance: Significant disparities were evident between models, with proprietary models (e.g., GPT-4V, GPT-4o) outperforming open-source counterparts, particularly when employing Chain-of-Thought (CoT) reasoning, which enhanced model accuracy by a substantial margin.

Training Resources and Enhancements

To address identified deficiencies, the authors explored the MMSci dataset as a training resource:

Visual Instruction-Following Data: This dataset is constructed to discuss figure content through a series of single or multi-turn interactions, reflecting real-world conversations about scientific figures.
Interleaved Text and Image Data for Pre-training: Articles and figures are interleaved to create a cohesive training corpus. Fine-tuning models on this dataset (e.g., 7B LLaVA model) yielded performance enhancements comparable to proprietary models like GPT-4V.

Case Study on Material Generation

A highlight of the paper is the case paper demonstrating the efficacy of continuous pre-training on MMSci. Utilizing this approach, the LLaMA2-7B model demonstrated improved stability and validity in generating novel crystal structures, essential tasks in materials science. This signifies the benefit of scientifically enriched training data, infusing the model with domain-specific knowledge that enhances its generative capabilities.

Implications and Future Directions

The implications of this research are manifold. Practically, MMSci enables the development of more capable and reliable AI assistants for scientific research, potentially automating parts of the research process such as literature review and data analysis. Theoretically, it provides insights into the integration of multimodal data within AI systems, furthering our understanding of how these systems can interpret and generate scientific content.

Future research directions could involve expanding the dataset to include more diverse forms of scientific content, such as supplementary materials and experimental datasets, or refining the evaluation metrics to capture nuanced aspects of model performance. The development of methods to seamlessly integrate multimodal pre-training with downstream task fine-tuning will also be pivotal.

Conclusion

MMSci stands as a significant contribution to the field of scientific AI, providing both a rigorous evaluation benchmark and valuable training resources. It bridges the gap in current model evaluations by focusing on PhD-level content and offers a path towards enhancing LMM capabilities in comprehending complex scientific literature. This work underscores the necessity of context-rich, diverse datasets in developing advanced AI solutions for academic and scientific endeavors.

Related Papers

Tweets

https://twitter.com/ZekunLi0323/status/1811081633191342311

https://twitter.com/fly51fly/status/1812249762189779444

https://twitter.com/knishimae0531/status/1811953160656945328