Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval (2109.05523v1)

Published 12 Sep 2021 in cs.CV and cs.CL

Abstract: Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grain semantic units in both sides of vision and language. For the training, we propose multi-scale matching losses from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.

Authors (7)

Zhihao Fan (28 papers)
Zhongyu Wei (98 papers)
Zejun Li (18 papers)
Siyuan Wang (74 papers)
Haijun Shan (8 papers)
Xuanjing Huang (288 papers)
Jianqing Fan (165 papers)

Citations (11)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval (2109.05523v1)

Summary

Related Papers