SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions (2307.01139v1)
Abstract: Instruction finetuning is a popular paradigm to align LLMs (LLM) with human intent. Despite its popularity, this idea is less explored in improving the LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow scientific multimodal instructions. To test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. In comparison to the models that are finetuned with machine generated data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Yao Fu. 2023. Evaluation scripts for mmlu. https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU.
- Llama-adapter v2: Parameter-efficient visual instruction model.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- The false promise of imitating proprietary llms.
- Improving zero and few-shot generalization in dialogue through instruction tuning. arXiv preprint arXiv:2205.12673.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
- Foundation models of scientific knowledge for chemistry: Opportunities, challenges and lessons learned. In Proceedings of BigScience Episode\normal-\\backslash\# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 160–172.
- Scicap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624.
- Summaries as captions: Generating figure captions for scientific documents with automated text summarization. arXiv preprint arXiv:2302.12324.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Coedit: Text editing by task-specific instruction tuning. arXiv preprint arXiv:2305.09857.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- How far can camels go? exploring the state of instruction tuning on open resources.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Guess the instruction! making language models stronger zero-shot learners. arXiv preprint arXiv:2210.02969.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.