Emergent Mind

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

(2403.02677)
Published Mar 5, 2024 in cs.CV and cs.CL

Abstract

We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs). Our approach outperforms predominant filtering methods (e.g., CLIPScore) via integrating the recent advances in MLMs. We design four distinct yet complementary metrics to holistically measure the quality of image-text data. A new pipeline is established to construct high-quality instruction data for fine-tuning MLMs as data filters. Comparing with CLIPScore, our MLM filters produce more precise and comprehensive scores that directly improve the quality of filtered data and boost the performance of pre-trained models. We achieve significant improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2) and various downstream tasks. Our MLM filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore. An additional ablation study is provided to verify our design choices for the MLM filter.

Overview

  • The paper introduces a novel method for filtering high-quality image-text pairs for Vision-Language Models (VLMs) training using finely-tuned Multimodal Language Models (MLMs).

  • This technique outperforms existing methods like CLIPScore by generating precise quality scores for data filtering.

  • The fine-tuning of MLMs for data filtering involves creating instruction data from advanced models and technologies, tailored for specific quality scoring tasks.

  • The effectiveness of the approach is demonstrated through significant performance improvements on the DataComp benchmark, showcasing the superior data filtering capability of fine-tuned MLMs.

Fine-Tuning Multimodal Language Models for High-Quality Image-Text Data Filtering

The performance of Vision-Language Models (VLMs) and Text-to-Image generation models largely depends on the quality of the image-text data they are trained on. However, web-crawled image-text data often contain noise, such as low-quality captions or images that do not match the corresponding text, creating a pressing need for effective data filtering techniques. In this regard, we introduce a novel approach that leverages finely-tuned Multimodal Language Models (MLMs) as data filters to select high-quality image-text pairs for VLM training.

Multimodal Language Models as Data Filters

Contrary to CLIPScore, which utilizes the CLIP model to estimate the cosine similarity between image and text embeddings for data quality assessment, our method integrates advancements in MLMs for filtering. Our fine-tuned MLM filters can generate precise and comprehensive quality scores, outperforming CLIPScore in identifying high-quality data that improves VLM performance.

Constructing High-Quality Instruction Data

To enable MLMs to accurately generate quality scores, we focus on fine-tuning them on specific quality scoring tasks. To construct the required instruction data for these tasks, we leverage proprietary models like GPT-4 and GPT-4V, combined with state-of-the-art image captioning models such as LLaVA or ShareGPT4V, for creating detailed text descriptions from images. This approach aids in evaluating image-text pairs based on various quality metrics, including Image-Text Matching (ITM), Object Detail Fulfillment (ODF), Caption Text Quality (CTQ), and Semantic Understanding (SU).

Fine-Tuning MLMs for Data Filtering

Through comprehensive ablation studies, we optimized the fine-tuning process for MLMs on multimodal instruction data tailored for scoring tasks. By integrating instructions on scoring tasks with a mixture of instructions from other multimodal tasks, we ensure a diverse and rich training dataset. Our fine-tuned MLMs are instructionally tuned on this mixed dataset, enhancing their ability to function as effective data filters.

Evaluation on DataComp Benchmark

We evaluated our MLM filters using the DataComp benchmark, which involves pre-training VLMs on filtered datasets and assessing their performance across a suite of downstream tasks. The results demonstrate significant improvements over existing data filtering techniques, including CLIPScore, illustrating the efficacy of our proposed MLM filters in selecting high-quality image-text data for training VLMs.

Conclusion and Future Directions

Our work represents a significant step forward in the realm of data filtering for VLM training. By harnessing the power of fine-tuned MLMs, we offer a novel and effective solution for selecting high-quality, comprehensive image-text pairs. The success of our MLM filters on the DataComp benchmark highlights their potential as superior alternatives to existing data filtering methods. As the field continues to evolve, further research is encouraged to explore and expand upon the capabilities of MLMs in data quality assessment and filtering tasks.

The capability of our MLM filters to accurately evaluate the quality of image-text data from various perspectives and improve the performance of VLMs suggests a promising direction for future research in enhancing the robustness and effectiveness of pre-trained models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.