Visual Relationship Detection with Language Priors (1608.00187v1)

Published 31 Jul 2016 in cs.CV

Abstract: Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.

Authors (4)

Cewu Lu (203 papers)
Ranjay Krishna (116 papers)
Michael Bernstein (23 papers)
Li Fei-Fei (199 papers)

Citations (1,092)

View on Semantic Scholar

Summary

The paper introduces a novel model that decouples object and predicate learning and leverages semantic word embeddings to predict visual relationships.
It applies CNN-based modules and bi-convex semantic embedding to accurately detect interactions, achieving high recall rates in predicate and relationship detection.
Practical evaluations show improved image retrieval and zero-shot learning performance, demonstrating the model’s scalability and effectiveness.

Visual Relationship Detection with Language Priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei from Stanford University present an approach to enhance visual relationship detection by integrating language priors, which has been a complex issue due to the extensive set of possible relationships between object pairs in images. This paper addresses the challenge by proposing a model that decouples the learning processes for objects and predicates, and then merges them using semantic word embeddings to enhance the prediction of relationships.

Background and Motivation

Visual relationships in images, such as "man riding bicycle" or "man pushing bicycle," consist of interactions between pairs of objects and are pivotal for the holistic understanding of images and accurate image retrieval. The vast diversity and infrequency of many visual relationships make it difficult to gather sufficient training examples for each one, which has traditionally limited relationship detection models to a small, manageable number of relationships.

Proposed Model

To overcome this limitation, the researchers introduce a model that separately trains visual models for objects and predicates and subsequently combines these models to predict relationships. The key components of their approach include:

Visual Appearance Module: This module learns the appearance of objects and their predicates independently. It employs convolutional neural networks (CNNs) to classify objects and predicates based on bounding box annotations in the training set.
Language Module: To leverage semantic similarities between predicates, the model uses pre-trained word embeddings (word2vec) to map objects and relationships into a vector space where semantically similar entities are close to each other. This process allows the model to draw inferences even when relationships are rare or unseen during training.

Methodology

The training algorithm for their model is bi-convex, optimizing both the visual appearance and semantic projections. Key steps in their method include:

Visual Appearance: Features extracted from the bounding boxes of objects and the union of pairs of bounding boxes are used to learn object and predicate classifiers.
Relationship Embedding: Relationships are projected into a vector space to exploit semantic similarities using word embeddings. This embedding helps fine-tune predictions via a scoring mechanism, enhancing both seen and unseen relationships' detection.
Training Objective: Combining a visual appearance function with a bi-convex semantic preservation function, the model enforces that frequent relationships produce higher likelihood scores while keeping similar relationships proximal in the embedding space.

Results and Evaluations

The performance of the proposed model is evaluated across multiple conditions, including predicate detection, phrase detection, and relationship detection. The results are benchmarked using metrics such as recall at K and mean average precision (mAP). Detailed comparisons with state-of-the-art methods like Visual Phrases and Joint CNN models highlight the superiority of the proposed model, particularly:

Predicate Detection: The model achieves significant improvement with a recall at 100 of 47.87, demonstrating effectiveness in identifying predicates between object pairs when objects are known.
Phrase Detection and Relationship Detection: Enhanced performance as compared to Visual Phrases and Joint CNN baselines, achieving a recall at 100 of up to 17.03 and demonstrating robust ability in both localizing and labeling relationships.
Zero-shot Learning: The model also excels in detecting unseen relationships by leveraging semantic similarities from the embedding space, showcasing its generalization capabilities.

Practical Implications and Future Directions

This paper presents substantial implications both practically and theoretically:

Scalability: The model can efficiently scale to predict thousands of relationship types, addressing the data scarcity issue inherent in detailed visual relationship tasks.
Enhanced Image Retrieval: Incorporating relationship understanding into content-based image retrieval significantly improves the retrieval effectiveness, as evidenced by improved recall rates and retrieval quality metrics.
Further Exploration: The method's embedding space creates pathways for exploring other vision-language tasks, emphasizing the potential to generalize across other multi-modal AI challenges.

Conclusion

The proposed visual relationship detection model effectively integrates visual and language priors to enhance its predictive capabilities. By decoupling the learning processes for objects and predicates and leveraging semantic similarities, the model overcomes the limitations of training data scarcity and scales efficiently to a large number of relationships. The practical benefits in image retrieval and zero-shot learning further underline the model's robustness and utility in real-world applications. This work indeed marks a significant step towards more nuanced and scalable relationship detection in computer vision.

PDF Markdown