Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models (2405.13872v2)

Published 22 May 2024 in cs.AI, cs.CL, and cs.CV

Abstract: Recent advancements in Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of LLMs in complex reasoning tasks. With the evolution of Multimodal LLMs (MLLMs), enhancing their capability to tackle complex multimodal reasoning problems is a crucial frontier. However, incorporating multimodal rationales in CoT has yet to be thoroughly investigated. We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step. Specifically, IoT prompting can automatically design critical visual information extraction operations based on the input images and questions. Each step of visual information refinement identifies specific visual rationales that support answers to complex visual reasoning questions. Beyond the textual CoT, IoT simultaneously utilizes visual and textual rationales to help MLLMs understand complex multimodal information. IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs. Moreover, the step-by-step visual feature explanations generated by IoT prompting elucidate the visual reasoning process, aiding in analyzing the cognitive processes of large multimodal models

Authors (6)

Qiji Zhou (8 papers)
Ruochen Zhou (4 papers)
Zike Hu (1 paper)
Panzhong Lu (4 papers)
Siyang Gao (15 papers)
Yue Zhang (620 papers)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces Image-of-Thought prompting, a method that integrates visual rationales into the Chain-of-Thought framework to improve reasoning in multimodal models.
It employs a structured approach with action planning, hybrid rationale generation, and answer refinement, achieving notable gains on benchmarks like MMBench, MME, and MMVet.
The approach reduces errors and hallucinations by grounding textual inferences in visual data while eliminating the need for expensive fine-tuning.

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal LLMs

The paper, "Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal LLMs" by Zhou et al., introduces a novel method termed Image-of-Thought (IoT) prompting to enhance the visual reasoning capabilities of Multimodal LLMs (MLLMs). This work addresses the challenge of integrating multimodal rationales within the framework of Chain-of-Thought (CoT) reasoning, which has proven effective in improving the reasoning performance of LLMs.

Introduction

The traditional CoT prompting techniques have advanced the complex reasoning tasks for LLMs, but there remains a limitation when dealing with multimodal data. The authors argue that solely relying on textual rationales is insufficient for tasks requiring a comprehensive understanding of multimodal inputs, such as visual and textual data. Human reasoning often involves constructing thought processes using both visual and textual cues simultaneously. Inspired by this cognitive process, the paper proposes the IoT prompting method to extract and utilize visual rationales in a step-by-step manner, enhancing the model's reasoning capacity for complex visual tasks.

Methodology

The IoT prompting method involves a structured approach to visual reasoning by:

Action Planning and Execution: The MLLM decomposes complex questions into a series of sub-goals and selects appropriate image processing tools to perform specific visual manipulations at each step. Actions such as segmentation, object detection, geometric transformations, and spatial ruler usage are integrated into the reasoning chain.
Hybrid Rationales Generation: For each sub-goal, the model generates both textual and visual rationales. These hybrid rationales are then concatenated to form a Multimodal Rationale Series (MRS), providing a comprehensive explanation that anchors textual reasoning in visual evidence.
Refinement of Final Answer: The MRS is fed back into the MLLM, which refines its final answer based on the integrated multimodal rationales.

Experimental Results

The empirical evaluations demonstrate the effectiveness of IoT prompting across three benchmark datasets: MMBench, MME, and MMVet. Key findings include:

MMBench: Significant improvements were observed in categories requiring spatial and physical reasoning. For instance, the IoT method enhanced the performance in the "Object Localization" and "Spatial Relationship" categories by notable margins for both GPT-4 and Gemini-Pro models.
MME: IoT prompting led to enhanced performance in cognitive tasks involving commonsense reasoning, numerical calculation, and code reasoning, highlighting the model's improved capability in processing and reasoning with multimodal data.
MMVet: The method demonstrated substantial improvements in OCR, knowledge-based reasoning, spatial awareness, and mathematical problems, indicating its robustness in diverse visual reasoning scenarios.

Implications and Future Directions

The results indicate that IoT prompting can significantly reduce modeling errors associated with traditional CoT approaches by grounding textual inferences in visual reality, thus mitigating the risk of hallucinations where models produce incorrect or unsupported content. This method leverages a train-free paradigm, which eliminates the need for expensive fine-tuning, making it a practical and scalable solution for enhancing MLLMs.

Future developments could focus on expanding the range of tools available for action planning and exploring the integration of IoT prompting in real-world applications such as robotics, where multimodal reasoning is crucial. Additionally, research could be conducted to address the limitations observed in specific categories, potentially by refining the visual rationale extraction processes to maintain high-resolution and contextually relevant information throughout the reasoning chain.

Conclusion

The IoT prompting method represents a substantial advancement in the field of multimodal reasoning for MLLMs, providing a systematic and integrated approach to combining visual and textual rationales. By aligning the reasoning process closely with how humans naturally incorporate visual and textual information, this method enhances both the accuracy and interpretability of model outputs. As MLLMs continue to evolve, techniques like IoT prompting will likely play a critical role in their ability to tackle increasingly complex and nuanced reasoning tasks.

PDF Markdown