Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator (2312.06731v6)
Abstract: Multimodal LLMs (MLLMs) demonstrate exceptional problem-solving capabilities, but few research studies aim to gauge the ability to generate visual instruction tuning data. This paper proposes to explore the potential of empowering MLLMs to generate data independently without relying on GPT-4. We introduce Genixer, a comprehensive data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at https://github.com/zhaohengyuan1/Genixer.
- OpenAI. Gpt-4v(ision) system card, 2023.
- Improved baselines with visual instruction tuning, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. 2023.
- Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325, 2015.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- A-okvqa: A benchmark for visual question answering using world knowledge. arXiv, 2022.
- Visual instruction tuning. In NeurIPS, 2023.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv:2210.08402, 2022.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Im2text: Describing images using 1 million captioned photographs. NeurIPS, 2011.
- Language models are few-shot learners. NeurIPS, 2020.
- OpenAI. Gpt-4 technical report, 2023.
- Palm 2 technical report. arXiv:2305.10403, 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv:2304.15010, 2023.
- Language is not all you need: Aligning perception with language models. arXiv:2302.14045, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv:2303.16199, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726, 2023.
- Pandagpt: One model to instruction-follow them all. arXiv:2305.16355, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. ACL, 2014.
- Point and ask: Incorporating pointing into visual question answering. arXiv preprint arXiv:2011.13681, 2020.
- Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Opt: Open pre-trained transformer language models. arXiv:2205.01068, 2022.
- Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023.
- Cogvlm: Visual expert for pretrained language models. 2023.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
- Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023.
- Cheap and quick: Efficient vision-language instruction tuning for large language models. 2023.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
- Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
- nocaps: novel object captioning at scale. In ICCV, 2019.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision, pages 662–681. Springer, 2022.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv:2206.08916, 2022.
- Position-guided text prompt for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23242–23251, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123:74–93, 2015.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Mimic-it: Multi-modal in-context instruction tuning. 2023.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.