Learning Graph Embeddings for Compositional Zero-shot Learning (2102.01987v3)

Published 3 Feb 2021 in cs.CV

Abstract: In compositional zero-shot learning, the goal is to recognize unseen compositions (e.g. old dog) of observed visual primitives states (e.g. old, cute) and objects (e.g. car, dog) in the training set. This is challenging because the same state can for example alter the visual appearance of a dog drastically differently from a car. As a solution, we propose a novel graph formulation called Compositional Graph Embedding (CGE) that learns image features, compositional classifiers, and latent representations of visual primitives in an end-to-end manner. The key to our approach is exploiting the dependency between states, objects, and their compositions within a graph structure to enforce the relevant knowledge transfer from seen to unseen compositions. By learning a joint compatibility that encodes semantics between concepts, our model allows for generalization to unseen compositions without relying on an external knowledge base like WordNet. We show that in the challenging generalized compositional zero-shot setting our CGE significantly outperforms the state of the art on MIT-States and UT-Zappos. We also propose a new benchmark for this task based on the recent GQA dataset. Code is available at: https://github.com/ExplainableML/czsl

Authors (4)

Muhammad Ferjad Naeem (21 papers)
Yongqin Xian (33 papers)
Federico Tombari (214 papers)
Zeynep Akata (144 papers)

Citations (128)

View on Semantic Scholar

Summary

The paper proposes the Compositional Graph Embedding (CGE) method to effectively learn dependencies between states, objects, and their compositions.
The approach leverages Graph Convolutional Networks to build a flexible, dependency-aware graph structure without relying on external resources.
Empirical results show significant performance gains on benchmarks like MIT-States, UT-Zappos, and C-GQA, highlighting robust generalization to novel compositions.

An Analysis of "Learning Graph Embeddings for Compositional Zero-shot Learning"

The paper entitled "Learning Graph Embeddings for Compositional Zero-shot Learning" introduces an innovative method for addressing the challenges inherent in Compositional Zero-shot Learning (CZSL). In CZSL, the focus is on recognizing novel compositions of observed visual primitives, characterized by their states and object attributes. The authors propose a method, termed Compositional Graph Embedding (CGE), to exploit the dependencies between states, objects, and their unseen compositions through a graph-based structure. This approach significantly enhances the generalization ability of the model to tackle unseen compositions, outstripping existing baselines on multiple benchmarks.

Contributions and Methodology

The methodology in this research is centered around a custom graph structure that embodies the relational information across states, objects, and their compositions. Notably, this graph does not depend on external resources like WordNet, thus providing a more flexible and universal approach to CZSL. Key contributions of the work include:

Compositional Graph Embedding (CGE): This novel formulation applies Graph Convolutional Networks (GCNs) to learn the dependencies and embed the relationships between visual primitives and compositional classes effectively.
Dependency Structure Utilization: The graph exploits the intrinsic dependencies between compositional elements, which allows the system to regularize its learning process, producing a globally consistent embedding space that generalizes well to new unseen compositions.
New Benchmark - C-GQA: The authors also introduce a new dataset, C-GQA, derived from the GQA dataset. This benchmark is designed to test CZSL models more effectively with diverse compositional classes and cleaner annotations.
Empirical Evaluation: CGE significantly improves on the state-of-the-art benchmarks including MIT-States, UT-Zappos, and the newly curated C-GQA datasets. The model's performance is evaluated using generalized CZSL settings and metrics such as AUC, best seen/unseen accuracy, and harmonic mean.

Numerical Results and Implications

The results presented in the paper demonstrate CGE's efficacy: it achieves an impressive Test AUC of 6.5% on MIT-States, 33.5% on UT-Zappos, and 3.6% on C-GQA, outperforming previous models by significant margins. This notable performance is attributed to the model's ability to effectively transfer knowledge from seen to unseen compositions through its graph structure.

The implications of such advancements are profound both practically and theoretically. Practically, the improved CZSL models can benefit visual recognition systems where class combinations are not exhaustively labeled, such as autonomous driving and robotics, where encountering novel object-state combinations is routine. Theoretically, the paper reinforces the importance of dependency-aware embeddings and the efficacy of graph-based learning structures.

Future Directions

Possible future research avenues following this paper might explore deeper GCN architectures capable of leveraging broader contextual knowledge without over-smoothing, as well as real-time applications where the learned embeddings could be tested in dynamic environments. Additionally, expanding the graph definition to encapsulate richer semantic attachments could further enhance the model's performance on more complex datasets.

In conclusion, this paper provides substantial evidence on how graph-based formulations can lead to more robust CZSL systems. The introduction of the compositional graph, combined with the CGE architecture, marks a significant step forward in the field of visual recognition where unseen visual compositions must be accurately predicted.

PDF Markdown

Related Papers

GitHub

GitHub - ExplainableML/czsl: PyTorch CZSL framework containing GQA, the open-world setting, and the CGE and CompCos methods. (107 stars)