InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 (2308.12067v2)
Abstract: Multimodal LLMs are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that LLMs can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce InstructionGPT-4, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6\% of the instruction-following data used in the alignment dataset for MiniGPT-4. To achieve this, we first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data. By employing this method, InstructionGPT-4 outperforms the original MiniGPT-4 on various evaluations. Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal LLMs to generate better output. Our code is available at https://github.com/waltonfuture/InstructionGPT-4.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Llama: open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971, 2023a.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023a.
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023b.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
- Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023.
- Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- OpenAssistant. Reward model trained from human feedback. https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2, 2023.
- On spectral clustering: Analysis and an algorithm. 2001.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Yutaka Sasaki et al. The truth of the f-measure. Teach tutor mater, 2007.
- Openclip. 2021.
- Deep learning on a data diet: Finding important examples early in training. NeurIPS, 2021.
- K-means++ the advantages of careful seeding. In SODA, 2007.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- Docvqa: A dataset for vqa on document images. In WACV, 2021.
- Towards vqa models that can read. In CVPR, 2019.
- Icdar 2019 competition on scene text visual question answering. In ICDAR. IEEE, 2019.
- Vizwiz: nearly real-time answers to visual questions. In UIST, 2010.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Training language models to follow instructions with human feedback, 2022. arXiv preprint arXiv:2203.02155, 2022.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.