Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning (2312.12379v5)

Published 19 Dec 2023 in cs.CV

Abstract: Instruction tuning of Large Vision-LLMs (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, the diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflict for the same set of model parameters, resulting in sub-optimal instruction-following abilities. To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve generalization capabilities of MoCLE for novel instructions. Extensive experiments on InstructBLIP and LLaVA demonstrate the effectiveness of MoCLE.

Citations (51)

View on Semantic Scholar

Summary

The paper presents MoCLE, which clusters similar tasks to employ dedicated LoRA experts and mitigate conflicts during instruction tuning.
It leverages a dynamic routing mechanism that assigns inputs to specialized or universal experts for improved generalization.
Empirical results on 10 zero-shot tasks show significant performance gains in image captioning and visual question answering compared to baselines.

Understanding LVLM Instruction Tuning with MoCLE

Instruction tuning for Large Vision-LLMs (LVLMs) has shown considerable promise in developing models capable of zero-shot generalization across various vision-language tasks. Despite this, the diversity of training tasks can lead to conflicts due to competing tasks vying for the same set of model parameters, leading to compromised instruction-following abilities.

To address this issue, a new architecture called Mixture of Cluster-conditional LoRA Experts (MoCLE) is introduced. MoCLE employs a Mixture of Experts (MoE) strategy whereby clusters of similar instruction tasks activate dedicated task-specific parameters, enhancing performance on those tasks and reducing conflicts. A universal expert is included as well to maintain and improve the model's generalization abilities for novel instructions. This unique structure allows the model to balance specialization in the tasks it has been trained on while retaining the flexibility to generalize to new tasks.

Task Conflicts in Multi-Task Instruction Tuning

Multi-task instruction tuning aims to utilize diverse collections of tasks to enhance a model's ability to understand and follow instructions. However, this complexity can lead to negative transfer, where simultaneous optimization for multiple conflicting tasks results in suboptimal outcomes.

Previous approaches to resolve negative transfer segmented training tasks into subsets based on predetermined categories, training a specialized "expert" for each category. This method showed limited scalability and impaired the model's ability to generalize across multiple tasks, especially unknown ones, presenting a challenge for the effectiveness of LVLMs in following diverse instructions.

Introducing MoCLE

The MoCLE framework resists the negative transfer through an innovative approach. It clusters training instructions into multiple groups by similarity and then routes input through a specialized task expert or a universal expert at each layer. This structure empowers the model to specialize in certain tasks while still learning generalized representations.

The experts are activated dynamically, chosen by routers that dispatch input based on instruction clusters. The innovation lies in these routers, which are trained to associate data with the correct expert to process it effectively. The MoCLE methodology resolves the tension between needing to specialize in particular tasks and the desire to generalize across a range of instructions.

Empirical Evaluation and Results

Extensive experiments on 10 zero-shot tasks elucidate the efficacy of MoCLE. When compared to a strong baseline—InstructBLIP, MoCLE attains substantial improvements on unseen tasks, including image captioning and various forms of visual question answering. The performance on tasks like IconQA is notably better than previous architectures, illustrating the power of the MoCLE approach.

Furthermore, ablation studies highlighting individual components of the MoCLE framework affirm the positive influence of the mixture of cluster-conditional experts and the universal expert on the model's zero-shot generalization capabilities.

Conclusion

The presented framework of MoCLE marks a significant advance in addressing task conflicts in the field of multi-task instruction tuning for LVLMs. By integrating specialized experts activated by instruction clusters and a universal expert, MoCLE adeptly balances between task specialization and generalization to novel instructions. The tangible improvements on a suite of tasks convey not just the functionality but the importance of the proposed method in the ongoing evolution of large vision-LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/448124299/status/1739221401372909843