Emergent Mind

Abstract

Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.

CogCoM enhances vision-language models by employing a chain of manipulations for evidential reasoning.

Overview

  • Introduces Chain of Manipulations (CoM) to promote deeper interaction between vision and language in Vision-Language Models (VLMs).

  • CogCoM, a 17B parameter VLM, uses CoM for advanced multimodal learning, improving capabilities in detailed visual reasoning.

  • Data synthesis algorithm creates reasoning chains for training, using linguistic/visual annotators and image-question-answer datasets.

  • CogCoM outperforms existing models on various benchmarks and exhibits resistance to 'hallucination' with encouraging visual problem-solving.

Introduction

In the realm of AI research, the ability to align visual data with linguistic information is crucial, particularly for Vision-Language Models (VLMs) which are used in tasks like visual question answering, image captioning, and more. However, the conventional approach of training VLMs often leads to models that overlook intricate visual reasoning or fail to detect meticulous visual details. To delve into this issue, a study introduced a mechanism named Chain of Manipulations (CoM) which fosters a deeper interaction between visual data and linguistic tasks.

Chain of Manipulations Mechanism

The core idea behind CoM is to enable VLMs to interpret visual data through a series of operations, or "manipulations", which are either inherent abilities gained through prior training or acquired by imitating human cognitive behaviors. This mechanism guides VLMs through a step-by-step process of evidence collecting and reasoning, drawing from details within the visual input. For instance, a model might first locate a particular object within an image before zooming in for finer detail or extracting text.

Data Synthesis and Model Training

To harness CoM, researchers devised a data synthesis algorithm using a mix of linguistic and visual annotators, such as powerful LLMs and cutting-edge recognition tools, to create chains of reasoning based on available image-question-answer datasets. After synthesizing these chains, a traversal process is applied to extract the most feasible paths leading to the correct answers. A model named CogCoM, a general 17B VLM, was developed using a compatible memory-based architecture that allows for such complex multimodal learning. The training incorporated these CoM chains to bolster the model's capabilities.

Experimental Outcomes

CogCoM's training involved a novel mix of instructional, grounding, detailed-captioning datasets, and CoM chains. The model was evaluated on eight key benchmarks spanning three categories of capabilities and displayed state-of-the-art performance across the board. More importantly, it exhibited robustness against hallucination and maintained competitive performance even with limited training steps. The study also introduced a testbed involving meticulous visual problems with a keypoint-aware metric to assess the correctness of reasoning paths, wherein CogCoM outperformed existing models.

Concluding Thoughts

This study posits a significant step forward in enhancing VLMs' ability to make faithful visual reasoning. The CoM mechanism shows promising potential in guiding VLMs through detailed and logical visual processing, akin to human cognition. While the method exhibits a need for diversity in linguistic steps and has room for improvement in the accuracy of visual tools, it nonetheless offers groundbreaking approaches to visual data interpretation and reasoning in AI models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

GitHub

GitHub - THUDM/CogCoM (145 stars)