Bi-directional Training for Composed Image Retrieval via Text Prompt Learning (2303.16604v2)

Published 29 Mar 2023 in cs.CV, cs.IR, and cs.LG

Abstract: Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mapping from the (reference image, modification text)-pair to an image embedding that is then matched against a large image corpus. One area that has not yet been explored is the reverse direction, which asks the question, what reference image when modified as described by the text would produce the given target image? In this work we propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures with minimum changes, which improves the performance of the model. To encode the bi-directional query we prepend a learnable token to the modification text that designates the direction of the query and then finetune the parameters of the text embedding module. We make no other changes to the network architecture. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model that itself already achieves competitive performance. Our code is released at https://github.com/Cuberick-Orion/Bi-Blip4CIR.

Citations (19)

View on Semantic Scholar

Summary

The paper presents a novel bi-directional training method for composed image retrieval by leveraging both forward and reverse query mappings.
The methodology uses a learnable token to indicate query direction, fine-tuning text embeddings without altering the overall network architecture.
Experiments on Fashion-IQ and CIRR show significant recall improvements, highlighting the approach's potential for robust image retrieval applications.

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

The paper entitled "Bi-directional Training for Composed Image Retrieval via Text Prompt Learning" introduces an innovative approach to the task of composed image retrieval (CIR), where the objective is to identify a target image based on a combination of a reference image and modification text. While traditional CIR models primarily focused on mapping this pair of inputs to target images, this work adds a novel dimension: leveraging the reverse mapping, where the task is to determine which reference image, when modified as per the given text, would yield the specified target image. This bi-directional approach stands to enrich the CIR task by incorporating the additional semantic structure of reversed queries.

The methodology involves a bi-directional training scheme applied to existing CIR models with minimal architectural modifications. The central idea is to prepend a learnable token to the modification text that indicates the directionality of the query—forward or reverse—thereby fine-tuning the text embedding module to understand this directionality without altering the broader network. The paper adopts techniques from vision-language pretraining models such as BLIP to achieve this semantic reversal without necessitating handcrafted linguistic inversion, thereby preserving the overall network architecture's integrity.

Experimental evaluations on two datasets—Fashion-IQ and CIRR—manifest that this integration of reversed queries results in improved retrieval accuracy. The baseline model, leveraging BLIP text and image encoders, already achieves state-of-the-art results, but the introduction of bi-directional training yields further improvement. Notably, the proposed model consistently surpasses previous CIR approaches. The strong numerical outcomes are most pronounced in Recall metrics, particularly at higher recall thresholds, demonstrating the robustness engendered by bi-directional training.

The implications of this research are multifaceted. Practically, the ability to leverage both forward and reversed queries opens new avenues for robust image retrieval systems in real-world applications such as e-commerce and visual search. Theoretically, the paper highlights the potential of bi-directional learning in multimodal tasks, suggesting a framework for exploiting additional semantic information inherent in dataset structures that might otherwise remain untapped.

Future developments in AI might explore further augmenting semantics in multimodal settings, possibly integrating additional modalities or applying similar bi-directional thinking to other complex retrieval problems. This paper provides a fertile ground for subsequent research to enhance the robustness and accuracy of CIR tasks, leveraging bi-directional mappings. In summary, this work brings forth an avenue where the reversibility of input semantics not only enriches model performance but also outlines a strategic direction for future research in the era of advanced retrieval systems.