Emergent Mind

Abstract

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of generative vision-and-language tasks, achieving new SoTA generalist performance that often matches or outperforms single specialized LMM baselines, as well as new SoTA specialist performance.

Overview

  • Omni-SMoLA introduces a scalable Soft Mixture of Low-rank Experts (SMoLA) approach to improve multimodal models without a large parameter increase.

  • By using lightweight low-rank experts, the architecture enhances models' ability to specialize in different tasks, such as image captioning or visual question answering.

  • The architecture achieves state-of-the-art (SoTA) performance on vision-and-language tasks, surpassing standard fine-tuning and some specialized large multimodal model baselines.

  • Omni-SMoLA maintains efficiency with minimal inference time slowdown despite the incorporation of additional experts.

  • It is adaptable and can be scaled with an increasing number of experts as needed, without extensive parameter changes.

Omni-SMoLA: A Scalable Approach for Enhancing Vision-and-Language Models with Soft Mixture of Low-rank Experts

One of the challenges facing large multimodal models (LMM), which process and generate content that includes different forms of data like images and text, is how to maintain performance levels while being adapted to a wide range of tasks. Typically, fine-tuning these models on too many tasks may lead to decreased effectiveness. This is where Mixture of Experts (MoE) architectures come into play, particularly for instruction tuning where a model is adapted to respond to specific instructions or tasks.

However, applying MoE architectures to large-scale models in the realm of 50 to 100 billion parameters presents a significant computational cost. The sheer volume of parameters involved in replicating and storing multiple expert models limits the practicability of using a large number of experts.

The paper presents an architecture called Omni-SMoLA, which uses a Soft MoE (SMoLA) approach to mix many multimodal low-rank experts, without introducing a significant number of new parameters compared to conventional MoE models. The key idea is to use lightweight experts that can be added to an existing base model to learn specialized knowledge, whether that be modality-specific or multimodal. These experts are designed to handle different tasks by focusing on particular types of tokens, such as text or visual, enabling the model to adapt to various requirements without a significant increase in parameters.

Omni-SMoLA demonstrates substantial improvements over standard fine-tuning in experiments conducted using vision-and-language tasks like image captioning and visual question answering. When applied to foundational models called PaLI-3 and PaLI-X, Omni-SMoLA achieved state-of-the-art (SoTA) performance across a broad range of generative tasks. This performance often matched or outstripped individual specialized LMM baselines, as well as established new SoTA records for specific tasks.

A further advantage of the Omni-SMoLA design is its parameter efficiency and low time complexity at inference. Despite the additional low-rank experts, the inference speed is only slightly slower compared to the base models, underlining the efficacy and efficiency of the design. The architecture is also adaptable, with the ability to include more experts as demands evolve, without needing an extensive parameter overhaul—a notable departure from traditional scaling methods.

Omni-SMoLA was evaluated on multiple settings, with variations including the number of experts used, different configurations of token modality experts, and varying the base models for initialization. The findings suggest that Omni-SMoLA outperforms the conventional full-model fine-tuned baselines for both PaLI-3 and PaLI-X, setting new SoTA results on multiple benchmarks under both generalist and specialist settings.

The architecture's ability to maintain high performance for a diverse range of tasks without a significant penalty to efficiency or scalability addresses a critical issue in the development and deployment of versatile and potent LMMs. In summary, Omni-SMoLA provides a solution to adapt large models to specialized tasks efficiently while enhancing their capacity to handle a wide array of applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.