Emergent Mind

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

(2401.15947)
Published Jan 29, 2024 in cs.CV

Abstract

Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks. Remarkably, with only approximately 3B sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

MoE-LLaVA's three-stage training strategy for adapting LLMs to visual inputs and enhancing multi-modal understanding.

Overview

  • The paper presents MoE-LLaVA, a framework for efficient Large Vision-Language Models using a Mixture of Experts approach.

  • MoE-LLaVA utilizes sparsity in model parameters to maintain computational efficiency while expanding model capability.

  • The novel MoE-tuning process is introduced to fine-tune MoE models for LVLM without performance loss.

  • Experiments show that MoE-LLaVA matches the performance of denser models while using fewer parameters and computational resources.

  • The paper's findings suggest significant potential for scalable and efficient model development in the field of AI.

Introduction

In the landscape of Large Vision-Language Models (LVLMs), the expansion of model parameters is a common approach to augment model capabilities, but this follows an increased computational burden during training and deployment. Dense models, where each token computation engages all model parameters, exacerbate this issue. Conversely, the Mixture of Experts (MoE) approach has exhibited success in scaling model capacity with fixed computational costs, particularly in the field of NLP.

Methodology: MoE-LLaVA and MoE-tuning

The paper introduces MoE-LLaVA, a framework for sparse LVLMs that leverages an MoE architecture with carefully engineered routers to selectively activate only the top-k experts. This unique configuration enables the maintenance of a constant computational cost while significantly expanding the model's parameter number. The framework consists of a vision encoder, visual projection layer, word embedding layer, LLM blocks, and sparse MoE blocks. The MoE-tuning strategy employs a novel three-stage training process to adapt MoE to LVLMs without performance degradation typically caused by model sparsity.

Experimental Results

Extensive experimentation validates the efficacy of MoE-LLaVA. When benchmarked against multiple visual understanding datasets, models with an unreasonably small parameter count of 3 billion—activated only sparsely—rivaled the performance of LLaVA models with up to 7 billion parameters. The authors establish that MoE-LLaVA delivers performance equivalent to dense LVLMs while requiring fewer computational resources, thus marking a significant contribution towards efficient multi-modal learning.

Contributions and Implications

The primary contributions are multifold:

  1. The innovation of MoE-tuning methodology for adapting MoE to LVLMs, which prevents degradation due to sparsity.
  2. The establishment of MoE-LLaVA, a pioneering framework for sparse LVLMs, which allows for substantial model size without proportional increases in computational demands.
  3. The demonstration through experiments that MoE-LLaVA possesses superior capabilities in multi-modal understanding and exhibits an impressive restraint on hallucination — it outpaces 13-billion-parameter models using only 3 billion sparsely activated parameters.

In theory, MoE-LLaVA has set a new precedent for developing scalable and efficient LVLMs. Results indicate that the paper's contributions could redefine model scaling paradigms, presenting a model that effectively navigates the trade-off between size, performance, and computational cost, which remains a critical challenge in AI research. Future research could expand upon these findings to include a wider array of multi-modal tasks and larger MoE-based LVLMs provided that adequate data pipelines are established.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube