Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices (2308.14352v2)

Published 28 Aug 2023 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs such as GPTs and Mixtral-8x7B have revolutionized machine intelligence due to their exceptional abilities in generic ML tasks. Transiting LLMs from datacenters to edge devices brings benefits like better privacy and availability, but is challenged by their massive parameter size and thus unbearable runtime costs. To this end, we present EdgeMoE, an on-device inference engine for mixture-of-expert (MoE) LLMs -- a popular form of sparse LLM that scales its parameter size with almost constant computing complexity. EdgeMoE achieves both memory- and compute-efficiency by partitioning the model into the storage hierarchy: non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated. This design is motivated by a key observation that expert weights are bulky but infrequently used due to sparse activation. To further reduce the expert I/O swapping overhead, EdgeMoE incorporates two novel techniques: (1) expert-wise bitwidth adaptation that reduces the expert sizes with tolerable accuracy loss; (2) expert preloading that predicts the activated experts ahead of time and preloads it with the compute-I/O pipeline. On popular MoE LLMs and edge devices, EdgeMoE showcase significant memory savings and speedup over competitive baselines. The code is available at https://github.com/UbiquitousLearning/mLLM.

Citations (34)

Summary

  • The paper introduces EdgeMoE, an on-device inference engine that partitions model storage by keeping non-expert weights in memory and loading expert weights on demand.
  • It employs adaptive expert-wise bitwidth adaptation and predictive in-memory caching to reduce I/O overhead while maintaining inference accuracy.
  • Empirical evaluations demonstrate significant memory efficiency and speedup, enabling practical execution of MoE LLMs with over 10 billion parameters on edge devices.

EdgeMoE: On-Device Inference Engine for MoE-based LLMs

Introduction

LLMs such as GPT and LLaMa have demonstrated significant capabilities in various machine learning tasks. However, deploying such models on edge devices introduces challenges primarily due to their vast parameter sizes leading to impractical runtime and memory costs. In response to these challenges, this paper introduces EdgeMoE, a pioneering on-device inference engine tailored for mixture-of-expert (MoE) based LLMs. EdgeMoE aims at optimizing both memory and computational efficiency by leveraging model partitioning across the device's storage hierarchy and introducing novel strategies for expert management.

EdgeMoE Design and Key Innovations

EdgeMoE's architecture rests on a crucial insight regarding the sparse activation patterns of MoE models, which suggests that a significant portion of expert weights, despite their volume, are infrequently accessed. Based on this, EdgeMoE stores non-expert weights in device memory while keeping expert weights in external storage, fetching them into memory only upon activation. This design efficiently responds to the challenges posed by the parameter size of MoE LLMs on edge devices.

To address the overhead associated with I/O operations during expert swapping, EdgeMoE introduces two innovative techniques:

  1. Expert-wise bitwidth adaptation: This method compresses the size of expert weights to reduce I/O data volume while maintaining an acceptable level of accuracy. Unlike simplistic quantization methods that negligibly reduce expert I/O, this approach adaptively adjusts the bitwidth of experts based on their sensitivity to quantization, achieving an optimized balance between model size and inference accuracy.
  2. In-memory expert management: By predicting expert activations in advance, EdgeMoE preloads the most likely to be activated experts into a compute-I/O pipeline, optimizing the process. This predictive loading, coupled with an expert cache eviction policy that factors in both activation frequency and layer-wise execution order, maximizes cache hit ratios and subsequently reduces unnecessary I/O operations.

Empirical Evaluation

EdgeMoE was assessed on various MoE-based LLMs across several edge devices, demonstrating substantial improvements in memory efficiency and inference speed against competitive baseline solutions. Specifically, compared to baseline solutions that either fully load the model into memory or perform dynamic loading of experts, EdgeMoE achieved memory footprint reductions and inference speedup by significant margins. This advancement allows for practical inference times for models with over 10 billion parameters on commercially available edge devices, a feat not previously attainable.

Implications and Future Directions

The research presented in this paper not only addresses the immediate challenges of deploying MoE-based LLMs on edge devices but also opens up avenues for further investigation into efficient model partitioning and execution strategies. The success of EdgeMoE suggests that similar principles could be applied to other types of large models, potentially broadening the scope of on-device AI applications. Future work might explore the adaptation of these techniques to different model architectures or explore optimizing the trade-offs between accuracy, memory use, and inference speed.

Conclusion

EdgeMoE presents a significant step forward in the deployment of MoE-based LLMs on edge devices, making it possible to leverage the capabilities of these models within the constraints of device memory and processing power. By innovatively addressing the challenges of expert management and model quantization, EdgeMoE enables efficient on-device inference, unlocking new possibilities for personalization, privacy, and real-time AI applications directly on user devices.