Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval (1802.00985v2)

Published 3 Feb 2018 in cs.IR

Abstract: Cross-modal information retrieval aims to find heterogeneous data of various modalities from a given query of one modality. The main challenge is to map different modalities into a common semantic space, in which distance between concepts in different modalities can be well modeled. For cross-modal information retrieval between images and texts, existing work mostly uses off-the-shelf Convolutional Neural Network (CNN) for image feature extraction. For texts, word-level features such as bag-of-words or word2vec are employed to build deep learning models to represent texts. Besides word-level semantics, the semantic relations between words are also informative but less explored. In this paper, we model texts by graphs using similarity measure based on word2vec. A dual-path neural network model is proposed for couple feature learning in cross-modal information retrieval. One path utilizes Graph Convolutional Network (GCN) for text modeling based on graph representations. The other path uses a neural network with layers of nonlinearities for image modeling based on off-the-shelf features. The model is trained by a pairwise similarity loss function to maximize the similarity of relevant text-image pairs and minimize the similarity of irrelevant pairs. Experimental results show that the proposed model outperforms the state-of-the-art methods significantly, with 17% improvement on accuracy for the best case.

Citations (44)

View on Semantic Scholar

Summary

The paper introduces a dual-path model that utilizes GCN-based text modeling to capture complex word relationships.
It constructs text graphs using k-nearest neighbor word2vec similarities and aligns modalities via a pairwise similarity loss.
Experiments demonstrate up to 35% improvement in text-image retrieval, outperforming traditional CNN-based models.

This essay explores the paper "Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval" (1802.00985). The research focuses on improving the performance of cross-modal information retrieval by introducing a novel model that leverages Graph Convolutional Networks (GCNs) to handle textual data while using traditional neural networks for image data.

Cross-modal information retrieval refers to the task of retrieving data of different modalities through queries presented in a single modality—images and texts in this scenario. The primary challenge in this domain is learning a representation that allows different modalities to be compared in a common semantic space. Traditional approaches typically rely on Convolutional Neural Networks (CNNs) for image feature extraction while utilizing word-level features, such as word2vec or bag-of-words, for text.

Figure 1: Comparison of classical cross-modal retrieval models to our model. (a) Classical models adopt feature vectors to represent grid-structured multimodal data; (b) Our model can handle both irregular graph-structured data and regular grid-structured data simultaneously.

Methodology

The proposed model is designed as a dual-path neural network aimed at learning representations from both textual and visual data to achieve cross-modal retrieval.

Text Modeling with GCN

In traditional models, word semantics in texts were modeled through vector space methods like word2vec, which lack consideration for word relationships. This paper adopts a graph-based representation of text, where the structure and semantics are incorporated through a graph explicitly constructed using a k-nearest neighbor approach based on word2vec cosine similarities.

The use of GCNs allows for the extraction of meaningful representations from these graphs by performing operations similar to convolutions in the spectral domain. Specifically, the text GCN applies two layers of graph convolution followed by non-linear activations, ultimately mapping text graph features into a shared semantic space with image features.

Image Representation with Neural Networks

For image data, the pipeline employs a conventional neural network initialized with pre-trained off-the-shelf features. These are fine-tuned through fully connected layers mirroring the dimensions of the text representation, ensuring that both modalities are comparable in the final semantic space.

Figure 2: The structure of the proposed model is a dual-path neural network: i.e., text Graph Convolutional Network (text GCN) (top) and image Neural Network (image NN) (bottom).

Objective Function and Training

The model aims to maximize similarity between matching text-image pairs and minimize it for non-matching pairs using a pairwise similarity loss function, which balances the mean similarity score and its variance for both matching and non-matching pairs. This approach contributes to learning robust representations that generalize well across different datasets.

Experimental Results

The model was evaluated on five datasets, including English Wikipedia and NUS-WIDE. It substantially outperformed existing models across tasks, particularly in text-query-image retrieval tasks, demonstrating up to 35% improvement over the second-best models.

Figure 3: Precision-recall curves on the five datasets.

Practical and Theoretical Implications

This research establishes a new route for cross-modal retrieval by integrating graph structures into text modeling, which was traditionally vector-based. The advancement provides a robust framework for handling irregular graph-structured data like text, expanding its application in non-Euclidean data domains. Future developments could explore end-to-end training for feature extraction components to further enhance retrieval tasks.

Conclusion

By modeling text with graphs and leveraging GCNs, the proposed dual-path approach facilitates more effective cross-modal retrieval than traditional methodologies. The model's ability to handle complex data structures directly impacts applications in semantic understanding and information retrieval across diverse, multimodal datasets.