Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

TouchStone: Evaluating Vision-Language Models by Language Models (2308.16890v2)

Published 31 Aug 2023 in cs.CV and cs.CL

Abstract: Large vision-LLMs (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with LLMs. However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.

Citations (34)

View on Semantic Scholar

Collections

Summary

The paper introduces TouchStone, a new evaluation framework that uses GPT-4 as an automated judge to assess LVLM dialogue with minimal human intervention.
The novel dataset evaluates a range of capabilities from image recognition to storytelling, revealing performance gaps and aligning closely with human judgment.
The analysis highlights challenges like hallucination and suggests improvements such as high-resolution inputs and enhanced spatial understanding for future LVLM development.

Evaluating Vision-LLMs Using LLMs

The paper presents an innovative approach for evaluating Large Vision-LLMs (LVLMs) by leveraging the capabilities of LLMs as evaluators. The authors introduce a novel evaluation method, termed TouchStone, which involves constructing a comprehensive visual dialogue dataset. This dataset is designed to cover a wide range of abilities from fundamental recognition to higher-order literary creation, encompassing five major categories and 27 subtasks derived from open-world images and questions. The innovative aspect of this methodology lies in employing LLMs, specifically GPT-4, as judges to assess the dialogue quality of LVLMs without the need for human intervention.

Dataset Construction and Evaluation Framework

The TouchStone dataset comprises open-world images paired with a series of questions designed to evaluate the model's different capabilities, including descriptive abilities, visual recognition, comprehension, storytelling, and multi-image analysis. By integrating detailed image annotations, the research transforms multimodal inputs into a format digestible by LLMs, allowing these models to act as automated judges. This methodology facilitates the evaluation process by comparing LVLM outputs with human preferences, using textual capabilities alone to determine dialogue quality.

The evaluation pipeline of TouchStone is structured to obviate the need for traditional human evaluation, thereby enhancing the efficiency and scalability of LVLM assessment. The research provides a robust comparison between model judgments and human evaluations, demonstrating that GPT-4 maintains a high degree of consistency with human preferences.

Performance and Hallucination Analysis

The results of the research highlight notable variances in LVLM performance across different capabilities. Visual recognition and comprehension remain challenging, with significant room for improvement, especially in areas such as mathematical problem-solving, chart analysis, and multi-image assessment. Additionally, hallucinations—instances where models predict content not present in the visual inputs—continues to be a prevailing issue. The paper systematically assesses this phenomenon, revealing disparities in hallucination tendencies across different models.

Notably, models that had undergone supervised fine-tuning or incorporated high-resolution inputs during training, such as Qwen-VL and mPLUG-Owl, showed enhanced performance in certain tasks, particularly in text recognition. Conversely, models relying primarily on image-text alignment, such as PandaGPT, exhibited higher hallucination scores, especially in scenarios where input quality was compromised.

Implications and Future Directions

The research outlined in this paper has significant implications for the field of AI, particularly in the development and evaluation of LVLMs. By utilizing LLMs as evaluators, the authors propose a scalable and efficient framework that could revolutionize how LVLMs are assessed, eliminating the need for extensive human benchmarking efforts. The use of comprehensive datasets like TouchStone could potentially act as a standard for evaluating multimodal AI models' capabilities comprehensively.

Future research directions may include enhancing LVLMs' spatial understanding, multi-image pre-training, and multi-task learning to improve model comprehension and reduce hallucinations. Additionally, exploring methods to bolster LLMs through multimodal content and address the underlying causes of hallucination could offer paths towards developing more robust and reliable models. Furthermore, increasing the resolution of input images and constructing models with explicit spatial and structural comprehension could also be promising areas of exploration.

Overall, this work contributes significantly to the ongoing discourse on AI model evaluation, offering a new paradigm for assessing complex multimodal interactions. The automated nature of this evaluation method, combined with its emphasis on aligning LVLM outputs with human expectations, provides a compelling avenue for advancing LVLM development and deployment in various real-world applications.