To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (2311.07574v2)
Abstract: Existing visual instruction tuning methods typically prompt LLMs with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA$w$ (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.
- Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
- Vqa: Visual question answering. In ICCV, 2015.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In CVPR, 2021.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Visual instruction tuning. In NeurIPS, 2023.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
- ShareGPT. https://sharegpt.com/, 2023.
- Where to look: Focus regions for visual question answering. In CVPR, 2016.
- Towards vqa models that can read. In CVPR, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
- Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2022.
- Boosting image captioning with attributes. In ICCV, 2017.
- Image captioning with semantic attention. In CVPR, 2016.
- Modeling context in referring expressions. In ECCV, 2016.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.