Emergent Mind

Language-only Efficient Training of Zero-shot Composed Image Retrieval

(2312.01998)
Published Dec 4, 2023 in cs.CV and cs.IR

Abstract

Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir

Training time (hours) vs. Zero-shot Composed Image Retrieval (ZS-CIR) performance comparison.

Overview

  • The paper introduces LinCIR, a novel framework for zero-shot composed image retrieval (ZS-CIR) that utilizes only language data for training, thereby eliminating the need for labor-intensive triplet datasets.

  • LinCIR leverages a technique called Self-Masking Projection (SMP) for self-supervision, replacing keywords in text with projected embeddings to ensure that original and modified texts share the same latent embedding vector.

  • The method's superior performance, efficiency, and scalability are demonstrated across multiple CIR benchmarks, outperforming existing ZS-CIR strategies and even surpassing some supervised methods.

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Introduction

The task of Composed Image Retrieval (CIR) aims to retrieve images that meet specific query conditions composed of both image and text inputs. Traditional CIR methods necessitate a training dataset comprising triplets of query image, query text, and target image, which is an expensive and labor-intensive process. Recent strategies have investigated the zero-shot composed image retrieval (ZS-CIR) paradigm to address this problem by eliminating the need for pre-collected triplets. These ZS-CIR approaches, however, have shown limitations in scalability and generalizability due to the lack of diverse textual inputs.

LinCIR (Language-only training for CIR) is introduced as a novel framework utilizing only language data for training. This method leverages a technique called Self-Masking Projection (SMP) for self-supervision. It replaces keywords in the original text with their projected embeddings, ensuring that the new and original texts have the same latent embedding vector. This innovative strategy enables LinCIR to achieve remarkable efficiency and effectiveness, showing superior performance on various CIR benchmarks, including achieving state-of-the-art results in zero-shot scenarios.

Methodology

Self-Masking Projection (SMP)

LinCIR utilizes a language-only self-supervision technique called Self-Masking Projection (SMP). Instead of the common practice of projecting image embeddings, SMP projects the text latent embedding to the token embedding space. During training, keywords in the textual input are replaced by their projected embeddings. This replacement is guided by minimizing the mean squared error (MSE) between the original and generated embeddings. Keywords are defined as consecutive adjectives and nouns, ensuring that the primary semantic essence of the text is retained in its modified form.

Random Noise Addition Strategy

To mitigate the modality gap between textual and visual embeddings, LinCIR introduces a noise addition strategy. Unlike simpler approaches that add Gaussian noise, LinCIR uses a noise distribution that ensures a diverse range of norm sizes, specifically $\mathcal{N}(0, 1) \times \text{Unif}(0, 1)$. This strategy addresses the dimensionality issues and the insufficient diversity problem of simple Gaussian noise, enhancing the generalizability of the projection module to visual embeddings.

Results

LinCIR demonstrates exceptional performance across multiple benchmarks, outperforming other ZS-CIR strategies such as Pic2Word and SEARLE. Key numerical results include:

  • On the CIRCO benchmark, LinCIR achieved leading scores across all metrics (e.g., mAP@5 of 19.71 with the ViT-G backbone).
  • On GeneCIS, LinCIR outperformed other methods in R@K metrics, especially in tasks focused on attributes.
  • For FashionIQ, LinCIR even surpassed the state-of-the-art supervised method.

Implications and Future Directions

LinCIR's framework presents significant implications for the field of image retrieval and vision-language models. By leveraging language-only training, LinCIR reduces the training dataset size and training time, markedly enhancing efficiency and scalability. This method also addresses the limitations of previous ZS-CIR models, showcasing superior adaptability to diverse and complex textual queries.

Future developments could explore further optimizations in noise addition strategies or investigate additional self-supervision techniques. Moreover, given the versatility exhibited by LinCIR, integrating this framework with other vision-language models (such as BLIP) could further enhance cross-modal retrieval capabilities and expand research into more diverse applications and domains.

Conclusion

LinCIR establishes an efficient approach to zero-shot composed image retrieval by employing self-supervision through language-only data and introducing an innovative random noise addition strategy. The method's scalability and marked performance improvements across various benchmarks underline its potential as a robust framework for vision-language tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.