Emergent Mind

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

(2308.14352)
Published Aug 28, 2023 in cs.LG , cs.AI , and cs.CL

Abstract

LLMs such as GPTs and LLaMa have ushered in a revolution in machine intelligence, owing to their exceptional capabilities in a wide range of machine learning tasks. However, the transition of LLMs from data centers to edge devices presents a set of challenges and opportunities. While this shift can enhance privacy and availability, it is hampered by the enormous parameter sizes of these models, leading to impractical runtime costs. In light of these considerations, we introduce EdgeMoE, the first on-device inference engine tailored for mixture-of-expert (MoE) LLMs, a popular variant of sparse LLMs that exhibit nearly constant computational complexity as their parameter size scales. EdgeMoE achieves both memory and computational efficiency by strategically partitioning the model across the storage hierarchy. Specifically, non-expert weights are stored in the device's memory, while expert weights are kept in external storage and are fetched into memory only when they are activated. This design is underpinned by a crucial insight that expert weights, though voluminous, are infrequently accessed due to sparse activation patterns. To further mitigate the overhead associated with expert I/O swapping, EdgeMoE incorporates two innovative techniques: (1) Expert-wise bitwidth adaptation: This method reduces the size of expert weights with an acceptable level of accuracy loss. (2) Expert management: It predicts the experts that will be activated in advance and preloads them into the compute-I/O pipeline, thus further optimizing the process. In empirical evaluations conducted on well-established MoE LLMs and various edge devices, EdgeMoE demonstrates substantial memory savings and performance improvements when compared to competitive baseline solutions.

Overview

  • Introduces EdgeMoE, an on-device inference engine designed for MoE-based LLMs, aimed at optimizing memory and computational efficiency.

  • Describes the novel strategies of EdgeMoE that include storing non-expert weights in memory and expert weights in external storage for efficient handling of large model parameters.

  • Outlines innovative techniques such as expert-wise bitwidth adaptation for reduced I/O data volume and in-memory expert management for minimizing unnecessary I/O operations.

  • Shows that EdgeMoE achieves significant improvements in memory efficiency and inference speed for MoE-based LLMs on edge devices, suggesting potential for broad on-device AI applications.

EdgeMoE: On-Device Inference Engine for MoE-based LLMs

Introduction

LLMs such as GPT and LLaMa have demonstrated significant capabilities in various machine learning tasks. However, deploying such models on edge devices introduces challenges primarily due to their vast parameter sizes leading to impractical runtime and memory costs. In response to these challenges, this paper introduces EdgeMoE, a pioneering on-device inference engine tailored for mixture-of-expert (MoE) based LLMs. EdgeMoE aims at optimizing both memory and computational efficiency by leveraging model partitioning across the device's storage hierarchy and introducing novel strategies for expert management.

EdgeMoE Design and Key Innovations

EdgeMoE's architecture rests on a crucial insight regarding the sparse activation patterns of MoE models, which suggests that a significant portion of expert weights, despite their volume, are infrequently accessed. Based on this, EdgeMoE stores non-expert weights in device memory while keeping expert weights in external storage, fetching them into memory only upon activation. This design efficiently responds to the challenges posed by the parameter size of MoE LLMs on edge devices.

To address the overhead associated with I/O operations during expert swapping, EdgeMoE introduces two innovative techniques:

  1. Expert-wise bitwidth adaptation: This method compresses the size of expert weights to reduce I/O data volume while maintaining an acceptable level of accuracy. Unlike simplistic quantization methods that negligibly reduce expert I/O, this approach adaptively adjusts the bitwidth of experts based on their sensitivity to quantization, achieving an optimized balance between model size and inference accuracy.
  2. In-memory expert management: By predicting expert activations in advance, EdgeMoE preloads the most likely to be activated experts into a compute-I/O pipeline, optimizing the process. This predictive loading, coupled with an expert cache eviction policy that factors in both activation frequency and layer-wise execution order, maximizes cache hit ratios and subsequently reduces unnecessary I/O operations.

Empirical Evaluation

EdgeMoE was assessed on various MoE-based LLMs across several edge devices, demonstrating substantial improvements in memory efficiency and inference speed against competitive baseline solutions. Specifically, compared to baseline solutions that either fully load the model into memory or perform dynamic loading of experts, EdgeMoE achieved memory footprint reductions and inference speedup by significant margins. This advancement allows for practical inference times for models with over 10 billion parameters on commercially available edge devices, a feat not previously attainable.

Implications and Future Directions

The research presented in this paper not only addresses the immediate challenges of deploying MoE-based LLMs on edge devices but also opens up avenues for further investigation into efficient model partitioning and execution strategies. The success of EdgeMoE suggests that similar principles could be applied to other types of large models, potentially broadening the scope of on-device AI applications. Future work might explore the adaptation of these techniques to different model architectures or delve deeper into optimizing the trade-offs between accuracy, memory use, and inference speed.

Conclusion

EdgeMoE presents a significant step forward in the deployment of MoE-based LLMs on edge devices, making it possible to leverage the capabilities of these models within the constraints of device memory and processing power. By innovatively addressing the challenges of expert management and model quantization, EdgeMoE enables efficient on-device inference, unlocking new possibilities for personalization, privacy, and real-time AI applications directly on user devices.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.