Emergent Mind

Abstract

Instruction tuning of Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, the diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflict for the same set of model parameters, resulting in sub-optimal instructionfollowing abilities. To address that, we propose the Mixture of Clusterconditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve generalization capabilities of MoCLE for novel instructions. Extensive experiments on 11 zero-shot tasks demonstrate the effectiveness of MoCLE.

Overview

  • Introduction of MoCLE architecture to improve instruction-following abilities in LVLMs by reducing task conflicts via a mixture of task-specific and universal experts.

  • MoCLE applies a Mixture of Experts approach to decrease negative transfer in multi-task instruction tuning, enhancing performance on clustered instruction tasks.

  • The model's structure with specialized task experts and a universal expert allows preserving generalization capabilities while focusing on specific tasks.

  • Empirical results demonstrate MoCLE's superior performance over InstructBLIP in unseen tasks including image captioning and visual question answering.

  • The study confirms MoCLE's effectiveness in striking a balance between task specialization and generalization, advancing the field of vision-language instruction tuning.

Understanding LVLM Instruction Tuning with MoCLE

Instruction tuning for Large Vision-language Models (LVLMs) has shown considerable promise in developing models capable of zero-shot generalization across various vision-language tasks. Despite this, the diversity of training tasks can lead to conflicts due to competing tasks vying for the same set of model parameters, leading to compromised instruction-following abilities.

To address this issue, a new architecture called Mixture of Cluster-conditional LoRA Experts (MoCLE) is introduced. MoCLE employs a Mixture of Experts (MoE) strategy whereby clusters of similar instruction tasks activate dedicated task-specific parameters, enhancing performance on those tasks and reducing conflicts. A universal expert is included as well to maintain and improve the model's generalization abilities for novel instructions. This unique structure allows the model to balance specialization in the tasks it has been trained on while retaining the flexibility to generalize to new tasks.

Task Conflicts in Multi-Task Instruction Tuning

Multi-task instruction tuning aims to utilize diverse collections of tasks to enhance a model's ability to understand and follow instructions. However, this complexity can lead to negative transfer, where simultaneous optimization for multiple conflicting tasks results in suboptimal outcomes.

Previous approaches to resolve negative transfer segmented training tasks into subsets based on predetermined categories, training a specialized "expert" for each category. This method showed limited scalability and impaired the model's ability to generalize across multiple tasks, especially unknown ones, presenting a challenge for the effectiveness of LVLMs in following diverse instructions.

Introducing MoCLE

The MoCLE framework resists the negative transfer through an innovative approach. It clusters training instructions into multiple groups by similarity and then routes input through a specialized task expert or a universal expert at each layer. This structure empowers the model to specialize in certain tasks while still learning generalized representations.

The experts are activated dynamically, chosen by routers that dispatch input based on instruction clusters. The innovation lies in these routers, which are trained to associate data with the correct expert to process it effectively. The MoCLE methodology resolves the tension between needing to specialize in particular tasks and the desire to generalize across a range of instructions.

Empirical Evaluation and Results

Extensive experiments on 10 zero-shot tasks elucidate the efficacy of MoCLE. When compared to a strong baseline—InstructBLIP, MoCLE attains substantial improvements on unseen tasks, including image captioning and various forms of visual question answering. The performance on tasks like IconQA is notably better than previous architectures, illustrating the power of the MoCLE approach.

Furthermore, ablation studies highlighting individual components of the MoCLE framework affirm the positive influence of the mixture of cluster-conditional experts and the universal expert on the model's zero-shot generalization capabilities.

Conclusion

The presented framework of MoCLE marks a significant advance in addressing task conflicts in the realm of multi-task instruction tuning for LVLMs. By integrating specialized experts activated by instruction clusters and a universal expert, MoCLE adeptly balances between task specialization and generalization to novel instructions. The tangible improvements on a suite of tasks convey not just the functionality but the importance of the proposed method in the ongoing evolution of large vision-language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.