Learning Visual Commonsense for Robust Scene Graph Generation

Published 17 Jun 2020 in cs.CV and cs.LG | (2006.09623v2)

Abstract: Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph, which do not follow real-world rules and patterns, and can be corrected using commonsense knowledge. We propose the first method to acquire visual commonsense such as affordance and intuitive physics automatically from data, and use that to improve the robustness of scene understanding. To this end, we extend Transformer models to incorporate the structure of scene graphs, and train our Global-Local Attention Transformer on a scene graph corpus. Once trained, our model can be applied on any scene graph generation model and correct its obvious mistakes, resulting in more semantically plausible scene graphs. Through extensive experiments, we show our model learns commonsense better than any alternative, and improves the accuracy of state-of-the-art scene graph generation methods.

Abstract PDF Upgrade to Chat

Citations (311)

View on Semantic Scholar

Summary

The paper presents a Global-Local Attention Transformer (GLAT) that integrates visual commonsense to refine scene graph outputs.
It disentangles perceptual errors by training separate perception and commonsense models, then fuses their outputs for more plausible scene interpretations.
Experiments on the Visual Genome dataset show enhanced recall and robust performance over state-of-the-art scene graph models.

Learning Visual Commonsense for Robust Scene Graph Generation

The paper "Learning Visual Commonsense for Robust Scene Graph Generation" presents a novel approach to enhance the robustness of scene graph generation (SGG) by integrating visual commonsense. Traditional SGG systems recognize objects and their predicates in scenes but often produce nonsensical compositions due to perception errors. The authors introduce a pioneering method that automatically learns visual commonsense, leveraging affordance and intuitive physics to correct these errors, thus improving the semantic plausibility of scene graphs.

Methodology

The authors extend Transformer models to incorporate scene graph structures, proposing the Global-Local Attention Transformer (GLAT). The GLAT model is designed to learn visual commonsense from a corpus of annotated scene graphs. It leverages both global and local attention mechanisms to encode the structure and context of graphs effectively. Once trained, the model can refine the output of any existing scene graph generation system, correcting mistakes that defy common real-world patterns.

The authors also address the issue of data bias inherent in conventional methods that integrate commonsense into neural networks reliant on training data statistics. They propose an approach to disentangle perception and commonsense, training them as separate models. A fusion module is introduced to synthesize the outputs from both models based on their classification confidence, ensuring a robust integration of perceptual and commonsensical knowledge.

Results

Experiments conducted on the Visual Genome dataset demonstrate significant improvements in scene graph generation using the proposed method. GLAT outperformed existing transformer and graph-based models in learning commonsense to reconstruct perturbed scene graphs. The method shows consistent improvements across different state-of-the-art SGG models, such as Iterative Message Passing (IMP), Stacked Neural Motifs (SNM), and Knowledge-Embedded Routing Networks (KERN). Specifically, integrating GLAT with these perception models and employing the proposed fusion mechanism led to enhanced recall values in scene graph classification tasks, indicating the effectiveness of incorporating visual commonsense.

Implications and Future Directions

The integration of visual commonsense into scene understanding frameworks presents several promising implications. Practically, it enhances the accuracy and reliability of artificial systems tasked with interpreting and interacting with visual data. Theoretically, it provides a foundation for understanding how contextual and structural information can be leveraged to improve semantic inference in artificial intelligence.

Future directions could include expanding this framework to other domains within computer vision and AI, exploring the integration of commonsense in more complex scene understanding tasks or multi-modal analyses, and improving the scalability of the approach. Additionally, investigating techniques to further mitigate the potential biases present in the commonsense learning process and enhancing the interpretability of the decision-making process in such AI systems would be worthwhile pursuits.

In summary, this research contributes significantly to the field of computer vision by proposing a robust method for integrating visual commonsense into scene graph generation, thereby addressing limitations of traditional SGG systems and opening avenues for further advancements in AI scene understanding.

Markdown Report Issue