Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models (2402.10884v2)

Published 16 Feb 2024 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Multi-modal LLMs (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets with which the underlying LLM was trained. To address this degradation, we first collect a lightweight, 5k-sample VQA preference dataset where answers were annotated by Gemini for five quality metrics in a granular fashion and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO) and SteerLM algorithms. Our findings indicate that with DPO, we can surpass the instruction-following capabilities of the LLM, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99. This enhancement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to the previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface, 2023. URL https://arxiv.org/abs/2303.17580.
  2. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  3. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  4. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023a.
  5. Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv preprint, abs/2303.11381, 2023. URL https://arxiv.org/abs/2303.11381.
  6. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv preprint, abs/2303.04671, 2023. URL https://arxiv.org/abs/2303.04671.
  7. Multimodal-gpt: A vision and language model for dialogue with humans. ArXiv preprint, abs/2305.04790, 2023. URL https://arxiv.org/abs/2305.04790.
  8. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
  9. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
  10. Visual instruction tuning, 2023a.
  11. mplug-owl: Modularization empowers large language models with multimodality. ArXiv preprint, abs/2304.14178, 2023a. URL https://arxiv.org/abs/2304.14178.
  12. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  13. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  15. Otter: A multi-modal model with in-context instruction tuning. ArXiv preprint, abs/2305.03726, 2023b. URL https://arxiv.org/abs/2305.03726.
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  17. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023b.
  18. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
  19. Sharegpt4v: Improving large multi-modal models with better captions, 2023.
  20. The false promise of imitating proprietary llms, 2023.
  21. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  22. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf, 2023.
  23. Aligning large multimodal models with factually augmented rlhf, 2023.
  24. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
  25. Zephyr: Direct distillation of lm alignment, 2023.
  26. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  27. HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM. arXiv, 2023. doi:10.48550/arXiv.2311.09528.
  28. SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs. arXiv, 2023. doi:10.48550/arXiv.2308.03349.
  29. Aligning large multi-modal model with robust instruction tuning. ArXiv preprint, abs/2306.14565, 2023b. URL https://arxiv.org/abs/2306.14565.
  30. Gemini: a family of highly capable multimodal models. ArXiv preprint, abs/2312.11805, 2023. URL https://arxiv.org/abs/2312.11805.
  31. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  32. OpenCompass, 2023. URL https://opencompass.org.cn/leaderboard-multimodal. [Online; accessed 24. Jan. 2024].
  33. Constitutional ai: Harmlessness from ai feedback, 2022.
  34. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  35. Evaluating object hallucination in large vision-language models, 2023c.
  36. Mmbench: Is your multi-modal model an all-around player?, 2023c.
  37. Rome: Evaluating pre-trained vision-language models on reasoning beyond visual common sense. ArXiv preprint, abs/2310.19301, 2023a. URL https://arxiv.org/abs/2310.19301.
  38. LIMA: Less Is More for Alignment. arXiv, 2023b. doi:10.48550/arXiv.2305.11206.
  39. A few more examples may be worth billions of parameters. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1017–1029, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.72.
  40. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning. arXiv, 2023d. doi:10.48550/arXiv.2306.14565.
Citations (19)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.