From Recognition to Cognition: Visual Commonsense Reasoning (1811.10830v2)

Published 27 Nov 2018 in cs.CV and cs.CL

Abstract: Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

Citations (817)

View on Semantic Scholar

Summary

The paper introduces the Visual Commonsense Reasoning task that bridges image recognition and cognitive inference by requiring rationales for answers.
It presents a 290K-question dataset from 110K movie scenes, utilizing Adversarial Matching to minimize biases in answer choices.
The R2C model integrates grounding, contextualization, and reasoning, outperforming earlier VQA approaches while highlighting the gap to human-level performance.

From Recognition to Cognition: Visual Commonsense Reasoning

The paper "From Recognition to Cognition: Visual Commonsense Reasoning" introduces an innovative approach to advancing the capabilities of visual question-answering (VQA) systems. Developed by Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi from the University of Washington and the Allen Institute for Artificial Intelligence, the work pivotally addresses the gap between simple recognition tasks and higher-order cognitive reasoning when interpreting images.

Key Contributions

Visual Commonsense Reasoning Task: The paper formalizes a new task—Visual Commonsense Reasoning (VCR)—that goes beyond merely identifying objects in an image. The task involves answering complex questions about an image and providing a rationale supporting the answer. This requires higher-level understanding such as inferring people’s actions, intentions, and social dynamics.
VCR Dataset: The authors introduce a large-scale dataset fashioned to challenge current VQA systems. The dataset comprises 290,000 multiple-choice questions derived from 110,000 movie scenes. Each question is paired with plausible answer choices and rationales that justify the chosen answers. The questions and choices are designed to mitigate biases using a method called Adversarial Matching.
Recognition to Cognition Networks (R2C): The paper presents the R2C model which encapsulates three key stages: grounding, contextualization, and reasoning. This framework ensures a thorough comprehension of image-related queries by attending to image regions, understanding the questions in context, and applying layered inferences to derive rational, explanatory answers.

Adversarial Matching

Adversarial Matching is notably crucial in the construction of the dataset. It minimizes biases by ensuring each correct answer is recycled as a negative choice in other questions. This method includes a constrained optimization process that uses relevance and entailment scores based on state-of-the-art natural language inference models. Consequently, it produces a dataset that is difficult for machines yet straightforward for humans, achieving human accuracy of over 90% while state-of-the-art models struggle.

Experimental Results

The experimental evaluations drew several significant observations:

Human vs. Machine Performance: While humans achieve impressive accuracy (>90%), current models, including R2C, show considerable difficulty, manifesting much lower performance (~65%).
Baseline Comparison: The R2C model markedly outperforms other vision and LLMs, such as BERT and earlier VQA models. However, even R2C exhibits significant headroom before approaching human-level cognition.
Ablation Study: The paper also performs comprehensive ablations to identify the contribution of various model components. Intriguingly, strong text representations (via BERT) were crucial to performance, more so than advanced reasoning modules.

Implications and Future Prospects

Practical Implications: The task of Visual Commonsense Reasoning has profound implications for developing AI systems that can interpret and reason about visual data in a human-like manner. Potential applications span from autonomous systems and human-computer interaction to advanced analytics in multimedia.

Theoretical Implications: The paper offers insights into the integral components necessary for bridging recognition and cognition. It foregrounds the utility of holistic reasoning systems that can integrate and reason over multiple data modalities—visual and textual.

Future Developments: The paper anticipates several avenues for future research. Enhancing the reasoning capabilities of existing models through more advanced attention mechanisms and leveraging larger, more diverse datasets could narrow the performance disparity between machines and humans. Additionally, fine-tuning and adapting models like BERT specifically for multimodal tasks, integrating extensive commonsense knowledge bases, and developing new baseline models will be critical in this progression.

Conclusion

"From Recognition to Cognition: Visual Commonsense Reasoning" significantly propels the field towards more sophisticated, reasoning-based AI systems. The introduction of the VCR task and dataset, along with the R2C model, offers a robust framework for future advancements. While current models display promising results, achieving human-level cognition remains an ongoing challenge, opening fertile grounds for future exploration in AI research.