Emergent Mind

Fast Inference of Mixture-of-Experts Language Models with Offloading

(2312.17238)
Published Dec 28, 2023 in cs.LG , cs.AI , and cs.DC

Abstract

With the widespread adoption of LLMs, many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their dense counterparts, but it also increases model size due to having multiple experts. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.

Diagram shows expert loading patterns in Mixtral-8x7B-Instruct; deeper blue suggests higher gating weight, gray indicates caching.

Overview

  • The paper addresses the challenge of deploying large Mixture-of-Experts (MoE) Language Models on systems with limited GPU memory.

  • It introduces MoE-specific offloading and mixed quantization techniques to reduce the model size and improve efficiency on consumer-grade hardware.

  • The offloading strategy improves caching and speculatively loads experts, minimizing the need for constant data transfer between RAM and GPU.

  • Experiments show that these methods significantly increase token generation speed for the MoE model Mixtral-8x7B across different hardware setups.

  • The paper promises to make powerful MoE models more accessible and paves the way for future research to enhance performance on limited hardware.

Introduction

LLMs have revolutionized natural language processing, but deploying them can be resource-intensive due to their massive size. They frequently require several high-end GPUs for operation, which can be a barrier for those without access to such hardware. This challenge is particularly acute with a subclass of LLMs known as Mixture-of-Experts (MoE) models, which offer efficient token generation but have larger model sizes that make them difficult to run on consumer-grade machines.

Addressing the MoE Challenge

The paper focuses on enabling the use of MoE LLMs on hardware with limited GPU memory, which is critical for making these powerful models more accessible. The research builds on parameter offloading techniques to cope with the limited memory in consumer accelerators. The authors developed tactics to effectively run a large MoE model known as Mixtral-8x7B on standard desktop computers and even free compute instances like Google Colab.

Offloading Strategy and Mixed Quantization

Two significant strategies were introduced: MoE-specific offloading and mixed quantization. The offloading approach observes regularities in how MoE models access their experts, which informed the development of an improved caching method that reduces the need for GPU-RAM data transfer, thus accelerating token generation. Moreover, the method speculatively loads experts for computation after identifying predictable patterns in expert layer usage.

Mixed quantization involves compressing the model parameters to reduce their size, allowing for easier transmission to the GPU. A system design combining the offloading strategies with a mixed MoE quantization scheme is laid out, which tailors the quantization levels for different layers of the MoE models. This strategy reduces loading times without severely compromising model performance.

Experimental Results and Conclusion

Through comprehensive experiments, the research confirms the efficacy of the caching and offloading techniques. When applying these to the Mixtral-8x7B MoE model, substantial increments in token generation speed were recorded across multiple hardware configurations. The authors' implementation was able to generate 2-3 tokens per second, depending on the hardware, showing a clear advantage over naive offloading methods.

This study offers a significant advancement in the practical deployment of large MoE models, broadening their accessibility. Future work will focus on refining these offloading strategies further and possibly exploring new approaches for speculative expert prediction to enhance performance even on more restricted hardware setups. The source code for this implementation has been made available, encouraging further research and development in this space.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube