Emergent Mind

Abstract

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a LLM into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.

Overview

  • VPD is a novel framework designed to improve vision-language models by integrating the generation of executable code with vision modules to solve complex tasks.

  • The framework generates multiple candidate programs with a Large Language Model, which are then executed and verified using specialized vision tools.

  • The accurate program's execution trace is translated into a natural language chain-of-thought and distilled into the VLM to enhance reasoning abilities.

  • The VPD-trained model, PaLI-X-VPD, shows state-of-the-art performance in vision tasks, outperforming existing VLMs and providing interpretable reasoning steps.

  • VPD's technique has been shown to improve factual consistency in model responses and is versatile enough for real-world applications, such as content moderation.

Introduction

In the realm of vision-language models (VLMs), advancements have been significant, yet the complexity of certain visual tasks still presents a challenge. These tasks require not just object identification but also spatial understanding and the retrieval of contextual knowledge. Although LLMs have shown aptitude in generating executable code to tackle sophisticated tasks, the programs they produce are prone to errors and inefficiencies, often missing crucial steps or including unnecessary ones. To overcome these obstacles and minimize computational costs, a novel framework called Visual Program Distillation (VPD) has been proposed.

Program Generation and Verification

VPD begins by generating multiple candidate programs using an LLM to solve a given task. These programs are then executed using specialized vision modules. A verification process follows to identify the correct program. For tasks with available labeled data, programs are filtered based on their output's correctness. Execution traces of the programs are recorded, detailing the usage of various vision tools during the process.

Distilling Step-by-Step

After identifying the correct program for a task, the next phase involves translating the program's execution trace into a natural language description of the reasoning steps, often referred to as a chain-of-thought (CoT). This CoT is then distilled into the VLM, with the aim of imbuing it with the same programmatic reasoning capabilities. This distillation process is crucial for improving the VLM’s abilities to count, decipher spatial relationships, and perform compositional reasoning.

Empirical Evidence of Efficiency

The VPD-trained model, referred to as PaLI-X-VPD, demonstrates state-of-the-art performance across several complex vision tasks, surpassing previous VLMs. It achieves this while also providing human-readable reasoning steps. Human annotators confirm that VPD enhances the factuality and consistency of model responses. Separate experiments in content moderation indicate the versatility of VPD, showcasing its utility in real-world applications even with limited data availability. The framework's inherent ability to generate accurate executable programs and distill complex reasoning into VLMs reveals its potential as a transformative approach in the field of AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube