SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers (2407.09413v3)

Published 12 Jul 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Seeking answers to questions within long scientific research articles is a crucial area of study that aids readers in quickly addressing their inquiries. However, existing question-answering (QA) datasets based on scientific papers are limited in scale and focus solely on textual content. We introduce SPIQA (Scientific Paper Image Question Answering), the first large-scale QA dataset specifically designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science. Leveraging the breadth of expertise and ability of multimodal LLMs (MLLMs) to understand figures, we employ automatic and manual curation to create the dataset. We craft an information-seeking task on interleaved images and text that involves multiple images covering plots, charts, tables, schematic diagrams, and result visualizations. SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits. Through extensive experiments with 12 prominent foundational models, we evaluate the ability of current multimodal systems to comprehend the nuanced aspects of research articles. Additionally, we propose a Chain-of-Thought (CoT) evaluation strategy with in-context retrieval that allows fine-grained, step-by-step assessment and improves model performance. We further explore the upper bounds of performance enhancement with additional textual information, highlighting its promising potential for future research and the dataset's impact on revolutionizing how we interact with scientific literature.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces SPIQA, a dataset that integrates figures and tables with text to enable advanced multimodal question answering on scientific papers.
It employs a robust combination of automated and manual curation methods, ensuring high-quality annotation of 270K question-answer pairs from 25,859 research articles.
Evaluation across multiple models reveals that both fine-tuned open-source and proprietary systems excel, highlighting the dataset’s potential to advance scientific QA tools.

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

The increasing complexity and volume of scientific literature necessitate advanced tools to aid researchers in efficiently extracting pertinent information. Addressing this need, the paper "SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers" introduces a novel dataset designed to evaluate and enhance the capabilities of multimodal LLMs (MLLMs) in understanding and interpreting figures and tables within computer science research articles. This strategic departure from text-only analysis leverages the rich, multidimensional data embedded in graphical elements which are crucial for a comprehensive understanding of research contributions.

Dataset Composition and Collection

SPIQA encapsulates a large-scale corpus, involving 25,859 papers published between 2018 and 2023 from 19 top-tier computer science conferences across various subfields including AI/ML, NLP, computer vision, and more. The dataset includes 152,487 figures and 117,707 tables, segmented into categories such as schematics, plots and charts, visualizations, and other figures. The dataset is constructed using a combination of automatic and human curation methods, ensuring high quality while mitigating the extensive labor typically required for manual annotation.

Question Generation and Filtering

The dataset encompasses 270,194 question-answer pairs, derived through a cutting-edge LLM (Gemini 1.5 Pro), which was empirically validated as the most proficient model for this task during a pilot paper. The questions are designed to foster a holistic understanding of the provided figures and tables within the context of the full research paper. Additionally, a robust filtering process was employed to refine the dataset, ensuring the relevance and correctness of the generated questions and answers. Further reliability is added through the inclusion of manually vetted questions for two test splits derived from existing datasets, QASA and QASPER, adapted to emphasize multimodal comprehension.

Evaluation and Results

The dataset is utilized to benchmark the performance of both closed-weight and open-weight LLMs on three different task setups:

Direct QA with Figures and Tables: The models must generate accurate answers using the figures and tables provided.
Direct QA with Full Paper: This task assesses the models' ability to handle long-context inputs by providing the full text along with figures and tables.
Chain-of-Thought (CoT) QA: Models must first identify the relevant figures or tables before answering the question, evaluating step-by-step reasoning capabilities.

Performance evaluation involves traditional QA metrics (METEOR, ROUGE-L, CIDEr, BERTScore) alongside a novel metric named LLMLogScore (L3Score). L3Score leverages an LLM to assess the log-likelihood of generated responses, offering a refined measure of semantic equivalence that mitigates limitations of traditional metrics.

Findings and Implications

The evaluation features comprehensive experiments across 12 prominent foundational models. Notably, closed-source models like GPT-4o and Claude-3 delivered superior performance compared to open-source counterparts, highlighting the current edge of proprietary systems. However, open-source models fine-tuned on SPIQA (InstructBLIP-7B and LLaVA-1.5-7B) showed significant improvements, underscoring the dataset’s potential for developing dedicated scientific QA systems.

Theoretical and Practical Implications

The introduction of SPIQA represents a pivotal step towards the development of sophisticated multimodal QA systems capable of nuanced comprehension and reasoning over scientific documents. The insights gleaned from this paper illustrate the potential for enhancing automated literature review processes, reducing the time researchers spend extracting relevant information from expansive documents, and promoting a more effective assimilation of scientific knowledge.

Future Directions

Future work can extend SPIQA to other scientific domains beyond computer science, improving the generalizability of QA systems. Further, integrating advanced machine learning techniques to better parse and understand the semantic content of tables and graphs, especially in complex scenarios, remains an imperative research direction. The continual enhancement and expansion of datasets like SPIQA will be instrumental in driving the next generation of intelligent research tools.

PDF Markdown

Related Papers

Tweets

https://twitter.com/gm8xx8/status/1812663075041870189

https://twitter.com/javaeeeee1/status/1812965132190032371