Emergent Mind

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

(2407.09413)
Published Jul 12, 2024 in cs.CL , cs.AI , and cs.CV

Abstract

Seeking answers to questions within long scientific research articles is a crucial area of study that aids readers in quickly addressing their inquiries. However, existing question-answering (QA) datasets based on scientific papers are limited in scale and focus solely on textual content. To address this limitation, we introduce SPIQA (Scientific Paper Image Question Answering), the first large-scale QA dataset specifically designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science. Leveraging the breadth of expertise and ability of multimodal LLMs (MLLMs) to understand figures, we employ automatic and manual curation to create the dataset. We craft an information-seeking task involving multiple images that cover a wide variety of plots, charts, tables, schematic diagrams, and result visualizations. SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits. Through extensive experiments with 12 prominent foundational models, we evaluate the ability of current multimodal systems to comprehend the nuanced aspects of research articles. Additionally, we propose a Chain-of-Thought (CoT) evaluation strategy with in-context retrieval that allows fine-grained, step-by-step assessment and improves model performance. We further explore the upper bounds of performance enhancement with additional textual information, highlighting its promising potential for future research and the dataset's impact on revolutionizing how we interact with scientific literature.

SPIQA tasks assess multimodal LLMs' ability to integrate and understand information from research paper figures, tables, and text.

Overview

  • The SPIQA dataset is a novel resource designed to evaluate and improve multimodal LLMs (MLLMs) by focusing on their ability to interpret figures and tables within computer science research papers, rather than relying on text-only analysis.

  • The dataset comprises over 25,000 papers, more than 150,000 figures, and 117,707 tables, collected from top-tier computer science conferences between 2018 and 2023. It includes a total of 270,194 question-answer pairs generated using the Gemini 1.5 Pro LLM and filtered for relevance and accuracy.

  • Comprehensive experiments on 12 foundational models revealed that closed-source models currently outperform open-source counterparts. However, significant improvements were seen with fine-tuned open-source models, highlighting the dataset's potential to enhance multimodal QA systems in scientific research.

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

The increasing complexity and volume of scientific literature necessitate advanced tools to aid researchers in efficiently extracting pertinent information. Addressing this need, the paper "SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers" introduces a novel dataset designed to evaluate and enhance the capabilities of multimodal LLMs (MLLMs) in understanding and interpreting figures and tables within computer science research articles. This strategic departure from text-only analysis leverages the rich, multidimensional data embedded in graphical elements which are crucial for a comprehensive understanding of research contributions.

Dataset Composition and Collection

SPIQA encapsulates a large-scale corpus, involving 25,859 papers published between 2018 and 2023 from 19 top-tier computer science conferences across various subfields including AI/ML, NLP, computer vision, and more. The dataset includes 152,487 figures and 117,707 tables, segmented into categories such as schematics, plots and charts, visualizations, and other figures. The dataset is constructed using a combination of automatic and human curation methods, ensuring high quality while mitigating the extensive labor typically required for manual annotation.

Question Generation and Filtering

The dataset encompasses 270,194 question-answer pairs, derived through a cutting-edge LLM (Gemini 1.5 Pro), which was empirically validated as the most proficient model for this task during a pilot study. The questions are designed to foster a holistic understanding of the provided figures and tables within the context of the full research paper. Additionally, a robust filtering process was employed to refine the dataset, ensuring the relevance and correctness of the generated questions and answers. Further reliability is added through the inclusion of manually vetted questions for two test splits derived from existing datasets, QASA and QASPER, adapted to emphasize multimodal comprehension.

Evaluation and Results

The dataset is utilized to benchmark the performance of both closed-weight and open-weight LLMs on three different task setups:

  1. Direct QA with Figures and Tables: The models must generate accurate answers using the figures and tables provided.
  2. Direct QA with Full Paper: This task assesses the models' ability to handle long-context inputs by providing the full text along with figures and tables.
  3. Chain-of-Thought (CoT) QA: Models must first identify the relevant figures or tables before answering the question, evaluating step-by-step reasoning capabilities.

Performance evaluation involves traditional QA metrics (METEOR, ROUGE-L, CIDEr, BERTScore) alongside a novel metric named LLMLogScore (L3Score). L3Score leverages an LLM to assess the log-likelihood of generated responses, offering a refined measure of semantic equivalence that mitigates limitations of traditional metrics.

Findings and Implications

The evaluation features comprehensive experiments across 12 prominent foundational models. Notably, closed-source models like GPT-4o and Claude-3 delivered superior performance compared to open-source counterparts, highlighting the current edge of proprietary systems. However, open-source models fine-tuned on SPIQA (InstructBLIP-7B and LLaVA-1.5-7B) showed significant improvements, underscoring the dataset’s potential for developing dedicated scientific QA systems.

Theoretical and Practical Implications

The introduction of SPIQA represents a pivotal step towards the development of sophisticated multimodal QA systems capable of nuanced comprehension and reasoning over scientific documents. The insights gleaned from this study illustrate the potential for enhancing automated literature review processes, reducing the time researchers spend extracting relevant information from expansive documents, and promoting a more effective assimilation of scientific knowledge.

Future Directions

Future work can extend SPIQA to other scientific domains beyond computer science, improving the generalizability of QA systems. Further, integrating advanced machine learning techniques to better parse and understand the semantic content of tables and graphs, especially in complex scenarios, remains an imperative research direction. The continual enhancement and expansion of datasets like SPIQA will be instrumental in driving the next generation of intelligent research tools.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.