Emergent Mind

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

(2405.05949)
Published May 9, 2024 in cs.CV

Abstract

Recent advancements in Multimodal LLMs have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage. Auxiliary losses are used to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks using models within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo.

CuMo's architecture integrates sparse Top-K MoE blocks with CLIP vision encoder enhancing multimodal LLM.

Overview

  • CuMo is a novel multimodal Large Language Model (LLM) that incorporates a Mixture-of-Experts (MoE) framework to manage the complexity of integrating text and visual data while reducing computational demands.

  • The architecture of CuMo includes sparsely-gated MoE blocks within the vision encoder and the MLP connector, promoting efficient scaling by activating only necessary 'experts' during inference.

  • CuMo demonstrates superior performance on various benchmarks, including tasks like Visual Question Answering, through its unique method of integrating and tuning MoE systems during training.

Understanding CuMo: Enhancing Multimodal LLMs with Mixture-of-Experts

Overview

Multimodal LLMs are evolving. As these models incorporate both text and visual data, they become increasingly complex and resource-intensive. Recent research by Jiachen Li and his colleagues proposes a novel model, CuMo, which utilizes a Mixture-of-Experts (MoE) framework to efficiently enhance multimodal LLMs, specifically focusing on improving the vision encoder side. Their method shows promise in reducing the additional computational cost while maintaining competitive performance on various benchmarks.

What is CuMo?

CuMo stands for Co-upcycled Models. It innovatively incorporates sparsely-gated MoE blocks into both the vision encoder and the MLP (Multi-Layer Perceptron) connector between the vision encoder and the LLMs. This architecture allows for a dynamic selection of a subset of parameters (or "experts") during inference, which helps keep the computational load manageable.

Key Features and Methodology of CuMo

Mixture-of-Experts (MoE)

  • Sparsely-Gated MoE Blocks: CuMo uses a routing mechanism that selects top-performing experts for each specific task. Only these selected experts are activated during the inference, reducing the number of computations needed.
  • Efficient Scaling: By integrating MoE into the model, CuMo aims to scale the capabilities of multimodal LLMs without the significant computational overhead typically associated with scaling up.

Training Methodology

The training of CuMo is broken down into several strategic stages:

  1. Pre-training of MLP Connectors: Before integrating MoE, the MLP connector is pre-trained to align vision and language representations effectively.
  2. Co-upcycling for Initialisation: This process uses pre-trained MLP blocks to initialize MoE blocks, fostering stability during training transitions.
  3. Visual Instruction Tuning with MoE: After initializing with co-upcycled models, the full system undergoes fine-tuning to optimize performance.

Adopting these stages helps in stabilizing the training process, crucial when dealing with the complex landscape of multimodal inputs and the dynamic nature of MoE systems.

Performance and Benchmarking

Results

CuMo has demonstrated superior performance over existing state-of-the-art multimodal LLMs across multiple benchmarks. This includes tasks related to Visual Question Answering (VQA) and visual-instruction-following. Specifically, CuMo has shown to outperform by adapting efficiently to various model sizes and utilizing open-sourced datasets for training.

Comparative Advantage

One of the stark advantages of CuMo is its ability to maintain low inference costs due to the MoE architecture, which only activates necessary parameters during the inferencing phase. Despite using dynamic component activation, CuMo does not compromise on the model's ability to understand and process complex multimodal tasks.

Future Implications

The use of MoE in enhancing multimodal LLMs opens numerous possibilities:

  • Efficient Scaling: As the demand for more sophisticated multimodal models grows, efficient scaling methods like those offered by CuMo will likely become standard in the industry.
  • Customizability and Flexibility: The ability to choose which 'experts' to activate could allow for more customizable models tailored to specific tasks or industries.
  • Broader Adoption in AI Tasks: Models like CuMo could enhance the performance of AI systems in areas like autonomous vehicles, healthcare (for medical imaging), and others where multimodal data interpretation is crucial.

In conclusion, CuMo, with its innovative use of MoE and a strategic training regimen, showcases how targeted improvements in model architecture can yield significant performance benefits. This research not only advances the capabilities of multimodal LLMs but also paves the way for more efficient and scalable AI models in an increasingly data-driven world.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

GitHub