Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models (2312.03052v2)
Abstract: Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a LLM into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-LLM (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.
- Tallyqa: Answering complex counting questions. In Proceedings of the AAAI conference on artificial intelligence, pages 8076–8084, 2019.
- Flamingo: a visual language model for few-shot learning, 2022.
- Palm 2 technical report, 2023.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023b.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2022a.
- Pali: A jointly-scaled multilingual language-image model, 2022b.
- Pali-x: On scaling up a multilingual vision and language model, 2023c.
- Pali-3 vision language models: Smaller, faster, stronger, 2023d.
- Binding language models in symbolic languages. arXiv preprint arXiv:2210.02875, 2022.
- Palm: Scaling language modeling with pathways, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Scaling vision transformers to 22 billion parameters, 2023.
- Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
- DepthLab: Real-Time 3D Interaction With Depth Maps for Mobile Augmented Reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. ACM, 2020.
- Layoutgpt: Compositional visual planning and generation with large language models. ArXiv, abs/2305.15393, 2023.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Visual programming: Compositional visual reasoning without training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023.
- Lora: Low-rank adaptation of large language models, 2021.
- Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699, 2022a.
- In-context learning for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2627–2643, Abu Dhabi, United Arab Emirates, 2022b. Association for Computational Linguistics.
- Avis: Autonomous visual information seeking with large language model agent, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
- In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 2017.
- Evaluating open-domain question answering in the era of large language models, 2023.
- The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes, 2020.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
- Segment anything, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
- Lisa: Reasoning segmentation via large language model, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023a.
- VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv, abs/1908.03557, 2019.
- Evaluating object hallucination in large vision-language models, 2023b.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Mmbench: Is your multi-modal model an all-around player?, 2023c.
- Unified-io: A unified model for vision, language, and multi-modal tasks. ArXiv, abs/2206.08916, 2022.
- Chameleon: Plug-and-play compositional reasoning with large language models, 2023.
- Faithful chain-of-thought reasoning, 2023.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019a.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019b.
- Improving hateful memes detection via learning hatefulness-aware embedding space through retrieval-guided contrastive learning, 2023.
- Scaling open-vocabulary object detection, 2023.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
- OpenAI. Gpt-4 technical report, 2023.
- Kosmos-2: Grounding multimodal large language models to the world, 2023.
- Toolformer: Language models can teach themselves to use tools, 2023.
- A-okvqa: A benchmark for visual question answering using world knowledge, 2022a.
- A-okvqa: A benchmark for visual question answering using world knowledge. arXiv preprint arXiv:2206.01718, 2022b.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
- Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
- Modular visual question answering via code generation, 2023.
- UL2: Unifying Language Learning Paradigms, 2022.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR, abs/2202.03052, 2022b.
- Scott: Self-consistent chain-of-thought distillation, 2023a.
- Cogvlm: Visual expert for pretrained language models, 2023b.
- Self-instruct: Aligning language models with self-generated instructions, 2022c.
- Non-intrusive adaptation: Input-centric parameter-efficient fine-tuning for versatile multimodal modeling, 2023c.
- Chain-of-thought prompting elicits reasoning in large language models, 2022.
- Autogen: Enabling next-gen llm applications via multi-agent conversation. 2023.
- Lemur: Harmonizing natural language and code for language agents, 2023.
- An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023a.
- Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023b.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
- The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems, pages 30378–30392. Curran Associates, Inc., 2022.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Sigmoid loss for language image pre-training, 2023.
- Glipv2: Unifying localization and vision-language understanding, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023a.
- Pad: Program-aided distillation specializes large models in reasoning, 2023b.