The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective (2407.08583v2)

Published 11 Jul 2024 in cs.AI, cs.CV, and cs.LG

Abstract: The rapid development of LLMs has been witnessed in recent years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from text to a broader spectrum of domains, attracting widespread attention due to the broader range of application scenarios. As LLMs and MLLMs rely on vast amounts of model parameters and data to achieve emergent capabilities, the importance of data is receiving increasingly widespread attention and recognition. Tracing and analyzing recent data-oriented works for MLLMs, we find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data. The co-development of multi-modal data and MLLMs requires a clear view of 1) at which development stages of MLLMs specific data-centric approaches can be employed to enhance certain MLLM capabilities, and 2) how MLLMs, utilizing those capabilities, can contribute to multi-modal data in specific roles. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective. A regularly maintained project associated with this survey is accessible at https://github.com/modelscope/data-juicer/blob/main/docs/awesome_LLM_data.md.

Citations (3)

View on Semantic Scholar

Summary

The paper highlights the intertwined evolution of data and multi-modal LLMs through strategies like augmentation, deduplication, and cross-modal alignment.
It demonstrates how data-centric techniques improve model instruction-following, reasoning, and ethical behavior using refined curation and evaluation benchmarks.
The paper reveals the reciprocal role of MLLMs in generating, transforming, and evaluating data, paving the way for autonomous co-development cycles.

This survey paper titled "The Synergy between Data and Multi-Modal LLMs: A Survey from Co-Development Perspective" systematically examines the interplay between multi-modal data and Multi-Modal LLMs (MLLMs). It posits that the development trajectories of data and MLLMs are intertwined rather than parallel, emphasizing the need for a co-development approach. This paper provides a thorough review of recent works that illustrate how data-centric approaches can enhance MLLM capabilities and how MLLMs can reciprocally enrich data curation processes. The survey is detailed and structured, covering both theoretical insights and practical implications, reflecting the authors' extensive research in the field.

Data Contributions for Scaling MLLMs

The initial part of the paper deals with the scalability of MLLMs, focusing on data acquisition, augmentation, and diversity. As MLLMs scale, their requirement for vast data volumes increases exponentially. The survey covers various data acquisition strategies, including web scraping, merging existing datasets, manual curation, and employing well-trained models like GPT-4V for automatic data generation. These approaches cater to different training stages of MLLMs, such as pretraining encoders, projectors, and fine-tuning for specific tasks.

Data augmentation techniques, particularly those leveraging LLMs and MLLMs, are highlighted for their ability to enhance dataset diversity and balance. For example, using LLMs to rewrite text descriptions can significantly improve text diversity while maintaining semantic integrity. Additionally, specific data-centric methods for imbalanced datasets, such as generating negative samples to balance classes, are discussed in detail.

The survey also explores strategies for improving the scaling effectiveness of datasets. These include data condensation methods—data deduplication, filtering low-quality data, and constructing kernel sets—that reduce data redundancy and enhance information density. Effective data mixture approaches are outlined to mitigate or leverage distribution biases, thereby optimizing data proportions at both dataset and batch levels. The importance of data packing techniques for long-context support and better pretraining convergence is also emphasized. Finally, cross-modal alignment techniques, predominantly CLIP score-based methods and text-centric anchoring, are discussed to ensure correct matching between different modalities in datasets.

Data Contributions for Usability Enhancement

Beyond scaling, the usability of MLLMs is crucial for practical applications. The paper categorizes enhancement techniques for instruction responsiveness, reasoning abilities, ethical considerations, and evaluation benchmarks. Instruction responsiveness can be improved through prompt design, high-quality ICL data, and human-behavior alignment datasets. These methods guide MLLMs to better understand and follow human instructions.

The survey identifies data-centric approaches to fortify MLLMs' reasoning abilities, covering single-hop and multi-hop reasoning with a particular focus on Chain-of-Thought (CoT) techniques. Ethical considerations, such as data toxicity and privacy, are extensively discussed. The paper reviews data-centric attack and defense strategies against toxic content and outlines privacy-preserving techniques like differentially private training and federated learning.

Comprehensive benchmarks are crucial for evaluating MLLMs' performance across various dimensions—understanding, generation, retrieval, and reasoning. The survey lists numerous benchmarks and provides insights into their contributions to assessing and improving MLLMs systematically.

In addition to how data enhances MLLMs, the paper explores the reverse direction: how models can contribute to data. This includes roles such as data creator, mapper, filter, and evaluator. MLLMs can generate data, refine existing data through transformations such as summarization and annotation, filter data based on quality assessments, and evaluate data to provide feedback on quality and ethics.

The paper highlights the potential of models to serve as data scientists, automating tasks like navigation, extraction, analysis, and visualization of multi-modal data. These capabilities reduce labor-intensive efforts and provide new insights for dataset curation and analysis.

Future Directions

The authors outline a roadmap for future research, emphasizing the need for infrastructural advancements to support data-model co-development and proposing several promising directions. These include enhancing automated data discovery, modality-compatibility detection, and knowledge transfer among models. The paper also discusses the potential of self-boosted development paradigms, where MLLMs iteratively improve both themselves and their training data in an autonomous cycle.

Implications and Conclusion

The survey highlights the symbiotic relationship between multi-modal data and MLLMs, showing that advances in one can significantly propel the other. The comprehensive compilation of methods and future directions provides valuable guidance for researchers and developers in the field of AI. This paper is essential reading for anyone interested in the cutting-edge development of large-scale multi-modal AI models, presenting a meticulous review and a visionary outlook on the future of AI.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/CodeTrendr/status/1907674577532236128