Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions (2307.01139v1)

Published 3 Jul 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Instruction finetuning is a popular paradigm to align LLMs (LLM) with human intent. Despite its popularity, this idea is less explored in improving the LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow scientific multimodal instructions. To test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. In comparison to the models that are finetuned with machine generated data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  4. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Yao Fu. 2023. Evaluation scripts for mmlu. https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU.
  7. Llama-adapter v2: Parameter-efficient visual instruction model.
  8. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  9. The false promise of imitating proprietary llms.
  10. Improving zero and few-shot generalization in dialogue through instruction tuning. arXiv preprint arXiv:2205.12673.
  11. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  12. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
  13. Foundation models of scientific knowledge for chemistry: Opportunities, challenges and lessons learned. In Proceedings of BigScience Episode\normal-\\backslash\# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 160–172.
  14. Scicap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624.
  15. Summaries as captions: Generating figure captions for scientific documents with automated text summarization. arXiv preprint arXiv:2302.12324.
  16. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
  17. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
  18. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  19. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  20. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
  21. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  22. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  23. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  24. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  26. Coedit: Text editing by task-specific instruction tuning. arXiv preprint arXiv:2305.09857.
  27. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  28. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  30. How far can camels go? exploring the state of instruction tuning on open resources.
  31. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  32. Guess the instruction! making language models stronger zero-shot learners. arXiv preprint arXiv:2210.02969.
  33. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  34. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
  35. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Citations (17)

Summary

  • The paper introduces SciTune, a novel framework that aligns LLMs with multimodal scientific instructions for superior data interpretation.
  • It employs scientific concept alignment and instruction tuning to integrate visual and textual scientific data, improving performance in tasks like figure captioning and classification.
  • The approach outperforms state-of-the-art models in benchmarks such as ScienceQA, demonstrating its potential to advance AI support in scientific research.

Overview of SCITUNE: Aligning LLMs with Scientific Multimodal Instructions

The paper presents a novel framework, SciTune, which enhances the alignment of LLMs with scientific multimodal instructions. The authors address the gap in aligning existing foundation models with scientific disciplines and aim to improve the ability of LLMs to process and understand multimodal scientific data.

Introduction

Instruction finetuning has been explored to align LLMs with human preferences, but its application in scientific contexts remains limited. SciTune builds upon this idea to develop LLMs capable of understanding and processing scientific multimodal inputs, by employing a human-generated scientific instruction dataset. This results in the creation of LLaMA-SciTune, a model that integrates a vision encoder with LLMs to enhance visual and language comprehension in scientific domains.

Methodology

The SciTune framework consists of two main stages:

  1. Scientific Concept Alignment: This stage focuses on learning from various scientific visual inputs like plots, charts, and diagrams, along with textual information including captions and OCR data.
  2. Scientific Instruction Tuning: This involves finetuning on a multimodal scientific reasoning task to ensure the generation of content aligned with scientific standards.

The framework is validated using LLaMA and LLaVA models, leading to the development of LLaMA-SciTune, which excels in the ScienceQA benchmark—surpassing human performance in many sub-categories.

Results and Evaluation

LLaMA-SciTune demonstrates superior performance in visual understanding tasks, outperforming state-of-the-art models like CLIP in zero-shot figure type classification and BLIP in scientific figure captioning. Notably, in the ScienceQA benchmark, LLaMA-SciTune exceeds human-level performance, highlighting its enhanced capability in scientific multimodal reasoning tasks. The model's performance is notably better when trained with additional scientific modalities, illustrating the impact of enriched multimodal data.

Implications and Future Directions

The implications of this research are twofold:

  • Practical Applications: The SciTune framework facilitates the development of AI systems that can better support scientific endeavors, including enhancing the accuracy of scientific data interpretation and synthesis in research settings.
  • Theoretical Progress: The work emphasizes the need for specialized instruction tuning datasets that reflect the complexity and diversity of scientific knowledge, paving the way for future studies on domain-specific model alignment.

In the broader context of AI advancements, SciTune represents a significant step towards domain-specific enhancements of LLMs. Future research may explore scaling these methods to other scientific domains and improving instruction datasets to further refine model performance in diverse scientific applications. The potential for combining SciTune with newer LLM architectures could also lead to more robust and effective AI systems tailored for scientific research and inquiry.