An Empirical Study of Multimodal Model Merging (2304.14933v2)

Published 28 Apr 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities. Furthermore, we conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture to create a parameter-efficient modality-agnostic architecture. Through comprehensive experiments, we systematically investigate the key factors impacting model performance after merging, including initialization, merging mechanisms, and model architectures. We also propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes. Our analysis leads to an effective training recipe for matching the performance of the modality-agnostic baseline (i.e., pre-trained from scratch) via model merging. Our method also outperforms naive merging significantly on various tasks, with improvements of 3% on VQA, 7% on COCO retrieval, 25% on NLVR2, 14% on Flickr30k and 3% on ADE20k. Our code is available at https://github.com/ylsung/vl-merging

Citations (31)

View on Semantic Scholar

Summary

The paper demonstrates that extending model merging methods to combine vision and language transformers yields a versatile, modality-agnostic architecture.
It systematically analyzes the impact of initialization, merging mechanisms like interpolation and RegMean, and architectural choices on performance.
Empirical results show up to 25% improvement in task performance, underscoring the efficiency and practical potential of the merging approach.

An Empirical Study of Multimodal Model Merging

The paper "An Empirical Study of Multimodal Model Merging" explores the integration of transformers trained on distinct modalities, such as vision and language, using model merging techniques. This research extends the existing concept of model merging, traditionally applied to models trained on similar tasks, to a multimodal framework. The principal objective is to develop a parameter-efficient, modality-agnostic model by merging modality-specific architectures—thereby substantially enhancing computational efficiency and effectiveness across multiple tasks.

Core Contributions

The primary contributions of this research can be distilled into several key areas:

Expansion to Multimodal Merging: The paper extends model merging techniques to combine vision, language, and cross-modal transformers. This is approached with the goal of forming a single modality-agnostic architecture that can process diverse inputs efficiently.
Systematic Analysis: The investigation meticulously evaluates the key factors that impact model performance post-merging, such as initialization methods, specific merging mechanisms, and the architectural setup of the models.
Evaluation Metrics: The authors introduce two novel metrics designed to evaluate the distance between model weights, which serve as predictors for the success of the merging process.
Empirical Results: Extensive experiments are conducted across several tasks, evidencing significant performance improvements when employing the proposed multimodal model merging techniques compared to naive merging methods.

Key Findings

Initialization and Seed Pre-training: One of the pivotal findings is the importance of initialization. The paper shows that pre-training the models on a common vision-language (VL) corpus helps align their weights, which is crucial for effective merging. The authors find that an equal number of iterations for seed pre-training and subsequent VL pre-training (100k each) optimally balance the merging performance with the unimodal model performance.

Merging Mechanisms: The paper compares three primary merging techniques—interpolation, modality arithmetic, and RegMean. It is found that interpolation, particularly with a weight ratio bias towards vision weights, provides competitive and computationally efficient results. RegMean, while computationally heavier, consistently offers robust performance.

Architectural Variants: The research also scrutinizes various architectural adaptations for shared-weight models. Surprisingly, the architecture with completely independent modality-specific modules before merging yields the best post-merging performance, matching closely with that of a modality-agnostic baseline pretrained from scratch.

Performance Across Tasks: The proposed method significantly improves task performance by margins of up to 25% on NLVR $^2$ , 14% on Flickr30k, and 7% on COCO retrieval compared to naive merging techniques. These gains highlight the practical utility of the model merging strategy.

Implications and Future Directions

Practical Implications: The improvements in performance underscore the potential of multimodal model merging in developing versatile, parameter-efficient architectures. This could lead to more efficient deployment of comprehensive AI models in real-world applications, encompassing tasks like visual question answering (VQA), image-text retrieval, and semantic segmentation, among others.

Theoretical Implications: The proposal and validation of metrics to predict merging outcomes present a new theoretical perspective in the field of model merging. These metrics, particularly the truncated soft sign dissimilarity (TSSD), could serve as foundational tools for future studies aiming to merge diverse pre-trained models efficiently.

Future Work: Moving forward, further investigation is warranted to extend model merging to models that are fine-tuned on downstream tasks. Additionally, exploring merging transformers initialized from unimodal pre-trained weights could provide insights into leveraging specialized domain knowledge. Another avenue is to mitigate domain shifts between pre-training and fine-tuning datasets to enhance merging performance stability across different tasks.

Conclusion

This paper effectively bridges the gap between theoretical exploration and practical application of model merging in a multimodal setup. By providing both comprehensive experimental evidence and a clear methodological framework, it opens up new pathways for developing sophisticated AI architectures capable of versatile, efficient multimodal understanding. The nuanced insights on the importance of initialization, merging mechanisms, and architecture types provide a solid foundation for future advances in this domain.

PDF Markdown

Related Papers

GitHub

GitHub - ylsung/vl-merging: PyTorch codes for the paper "An Empirical Study of Multimodal Model Merging" (35 stars)