Emergent Mind

Abstract

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce negCLIPLoss, a CLIP loss-inspired method that adds the alignment between one sample and its contrastive pairs as an extra normalization term for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, NormSim, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp~\cite{gadre2023datacomp}. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5.3\% improvement on ImageNet-1k and a 2.8\% improvement on 38 downstream evaluation tasks. Moreover, both negCLIPLoss and NormSim are compatible with existing techniques. By combining our methods with the current best methods DFN~\cite{fang2023data} and HYPE~\cite{kim2024hype}, we can boost average performance on downstream tasks by 0.9\%, achieving a new state-of-the-art.

negCLIPLoss improves CLIPScore accuracy by adding normalization and using a teacher model for calculation.

Overview

  • The paper addresses improving data selection during training of large-scale visual-language models, focusing on quality and relevance of data, using novel metrics universally applicable to any CLIP embedding.

  • It introduces two innovative methods: negCLIPLoss, which refines the CLIP loss function to reduce bias, and NormSim, a norm-based similarity metric enhancing data filtering performance for relevant downstream tasks.

  • Comprehensive experiments using the DataComp benchmark demonstrate significant improvements in data quality estimation and overall performance, showcasing the effectiveness of these methods in multimodal contrastive learning.

Overview of "CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning"

The paper titled "CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning" addresses the critical issue of data selection during the training of large-scale visual-language models, specifically in the context of the CLIP model. This work is significant because the quality and relevance of data used for training can substantially impact the performance of such models, especially when dealing with noisy, web-curated datasets.

Introduction

The authors identify three primary approaches for data selection in the context of large-scale visual-language models:

  1. Using external, non-CLIP models to aid in data selection.
  2. Training new CLIP-style embedding models that improve data selection efficacy compared to the original CLIP model by OpenAI.
  3. Designing improved metrics or strategies that are universally applicable to any CLIP embedding without needing specific model properties.

While the first two approaches have been extensively studied, this paper focuses on the third approach, which has been relatively under-explored.

Methodology

The paper introduces two novel methods: negCLIPLoss and NormSim.

negCLIPLoss:

NegCLIPLoss is derived from the standard CLIP loss function. The key idea is to refine the traditional CLIPScore metric, which measures the cosine similarity between visual and language embeddings of the same sample. The negCLIPLoss incorporates an additional normalization term to account for consistency across contrastive pairs. This refinement aims to mitigate biases present in the CLIP scores, resulting in a more accurate measure of data quality.

The computation of negCLIPLoss is detailed as follows: [ \text{negCLIPLoss}(xi{vl}) = -\frac{\tau}{K}\sum{k=1}K\ell{Bk}(x_i{vl}), ] where (\ell{Bk}) represents the standard CLIP loss computed over batches (B_k) sampled from the training data. The normalization term reduces biases by normalizing against contrastive pair consistencies.

Experiments show that negCLIPLoss consistently outperforms the traditional CLIPScore across various dataset sizes and evaluation metrics.

NormSim:

NormSim is a norm-based similarity metric designed to measure the relevance of training data with respect to known downstream tasks. This metric is particularly useful when downstream task distribution is accessible. NormSim evaluates the vision-only similarity between a sample and the target data distribution, defined as: [ \text{NormSim}p(X\text{target}, x) := |\bar{f}v(X\text{target}v) \bar{f}v(xv) |p, ] where (\bar{f}v) is the vision encoder, and (| \cdot |p) denotes the (p)-norm.

The experiments incorporate different downstream tasks for target data, such as the ImageNet-1K training set and the combined training data from 24 downstream tasks. NormSim, particularly when combined with negCLIPLoss, significantly enhances data filtering performance compared to other state-of-the-art methods.

Experimental Results

The paper presents a comprehensive evaluation using the DataComp benchmark. Key findings include:

  • NegCLIPLoss enhances data quality estimation, outperforming traditional CLIPScore by significant margins (5.3% on ImageNet-1K and 2.8% on average across 38 downstream tasks).
  • Combining negCLIPLoss and NormSim yields superior performance, demonstrating their complementary strengths.
  • NegCLIPLoss can be universally applied across different CLIP models, such as OAI CLIP-L/14, OAI CLIP-B/32, and DFN-P.

Implications and Future Perspective

This research highlights the versatility and effectiveness of optimized metrics like negCLIPLoss and NormSim in improving data selection for multimodal contrastive learning models. Such universal and resource-efficient strategies are crucial, given the exponentially growing scale of training datasets and computational costs.

A notable insight is that models trained exclusively with CLIP embeddings (D1 category) can achieve performance metrics comparable to those employing external models (D3 category). This suggests that future work might focus on further refining CLIP-based selection methods, potentially reducing dependence on external data or models.

Future research could explore:

  • Incorporating dynamic sampling strategies, such as NormSim-D, when downstream task information is incomplete.
  • Investigating whether the proposed methods synergize with other advanced filtering techniques, such as utilizing state-of-the-art pre-trained embeddings for calculating normalization in negCLIPLoss.
  • Extending these methods to even larger datasets and evaluating their generalizability across more diverse downstream tasks.

In conclusion, the proposed methods—negCLIPLoss and NormSim—offer a robust framework for enhancing data selection in multimodal contrastive learning, paving the way for more efficient and scalable training of large-scale visual-language models. Their universal applicability makes them valuable tools for the growing field of multimodal AI.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.