SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions (2307.01139v1)

Published 3 Jul 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Instruction finetuning is a popular paradigm to align LLMs (LLM) with human intent. Despite its popularity, this idea is less explored in improving the LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow scientific multimodal instructions. To test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. In comparison to the models that are finetuned with machine generated data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark.

References (35)

Citations (17)

View on Semantic Scholar

Summary

The paper introduces SciTune, a novel framework that aligns LLMs with multimodal scientific instructions for superior data interpretation.
It employs scientific concept alignment and instruction tuning to integrate visual and textual scientific data, improving performance in tasks like figure captioning and classification.
The approach outperforms state-of-the-art models in benchmarks such as ScienceQA, demonstrating its potential to advance AI support in scientific research.

Overview of SCITUNE: Aligning LLMs with Scientific Multimodal Instructions

The paper presents a novel framework, SciTune, which enhances the alignment of LLMs with scientific multimodal instructions. The authors address the gap in aligning existing foundation models with scientific disciplines and aim to improve the ability of LLMs to process and understand multimodal scientific data.

Introduction

Instruction finetuning has been explored to align LLMs with human preferences, but its application in scientific contexts remains limited. SciTune builds upon this idea to develop LLMs capable of understanding and processing scientific multimodal inputs, by employing a human-generated scientific instruction dataset. This results in the creation of LLaMA-SciTune, a model that integrates a vision encoder with LLMs to enhance visual and language comprehension in scientific domains.

Methodology

The SciTune framework consists of two main stages:

Scientific Concept Alignment: This stage focuses on learning from various scientific visual inputs like plots, charts, and diagrams, along with textual information including captions and OCR data.
Scientific Instruction Tuning: This involves finetuning on a multimodal scientific reasoning task to ensure the generation of content aligned with scientific standards.

The framework is validated using LLaMA and LLaVA models, leading to the development of LLaMA-SciTune, which excels in the ScienceQA benchmark—surpassing human performance in many sub-categories.

Results and Evaluation

LLaMA-SciTune demonstrates superior performance in visual understanding tasks, outperforming state-of-the-art models like CLIP in zero-shot figure type classification and BLIP in scientific figure captioning. Notably, in the ScienceQA benchmark, LLaMA-SciTune exceeds human-level performance, highlighting its enhanced capability in scientific multimodal reasoning tasks. The model's performance is notably better when trained with additional scientific modalities, illustrating the impact of enriched multimodal data.

Implications and Future Directions

The implications of this research are twofold:

Practical Applications: The SciTune framework facilitates the development of AI systems that can better support scientific endeavors, including enhancing the accuracy of scientific data interpretation and synthesis in research settings.
Theoretical Progress: The work emphasizes the need for specialized instruction tuning datasets that reflect the complexity and diversity of scientific knowledge, paving the way for future studies on domain-specific model alignment.

In the broader context of AI advancements, SciTune represents a significant step towards domain-specific enhancements of LLMs. Future research may explore scaling these methods to other scientific domains and improving instruction datasets to further refine model performance in diverse scientific applications. The potential for combining SciTune with newer LLM architectures could also lead to more robust and effective AI systems tailored for scientific research and inquiry.

PDF Markdown

Related Papers

GitHub

GitHub - lupantech/ScienceQA: Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering". (677 stars)