Emergent Mind

Abstract

The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference. Furthermore, we implement these methods in our mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK processors. We evaluated Transformer-Lite's performance using LLMs with varied architectures and parameters ranging from 2B to 14B. Specifically, we achieved prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for smaller Gemma 2B, respectively. Compared with CPU-based FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the prefill speed and 2~3x speedup for the decoding speed.

Overview

  • The paper introduces Transformer-Lite, an optimization engine for deploying LLMs on mobile GPUs, addressing computational and memory constraints.

  • It proposes optimization techniques such as symbolic expression-based dynamic shape inference, operator optimization, FP4 quantization (M0E4), and sub-tensor-based KV cache optimization.

  • Experimental evaluations show that Transformer-Lite significantly outperforms existing solutions, enabling the deployment of up to 14B parameter models on mobile devices with remarkable speed improvements.

  • The study suggests that these optimizations could revolutionize mobile AI by enabling high-efficiency, real-time applications, reducing reliance on cloud models, and enhancing privacy.

High-Efficiency Deployment of LLMs on Mobile Phone GPUs

Introduction

The paper introduces methodologies for deploying LLMs on mobile device GPUs efficiently. Given the computational and memory bandwidth constraints inherent in mobile phones, existing methods result in slower inference speeds, adversely affecting user experience. The authors propose a suite of optimization techniques aimed at addressing these challenges. These techniques comprise a symbolic expression-based approach for dynamic shape model inference, operator optimization and execution priority setting, an FP4 quantization method named M0E4, and a sub-tensor-based technique for KV cache. The implementation of these optimizations in the proposal of a new mobile inference engine, Transformer-Lite, is discussed, highlighting its effectiveness in deploying LLMs on mobile platforms with substantial speed improvements over existing solutions.

Key Optimizations

The paper outlines four primary optimization strategies aimed at enhancing LLM deployment on device GPUs:

  • Symbolic Expression-Based Dynamic Shape Inference: This approach addresses the challenge of dynamic input shapes in LLM deployment by employing symbolic expressions to infer and manage the dynamic shape of tensors efficiently.
  • Operator and Lagging Optimizations: Enhancements include setting execution priorities and optimizing operators for improved performance and reduced device lag, focusing on the peculiarities of LLM operations like matrix multiplication.
  • M0E4 FP4 Quantization Method: Introduces a quantization technique that minimizes performance overhead during dequantization, allowing for efficient matrix multiplication with half-precision activation and 4-bit quantized weights.
  • Sub-Tensor-Based KV Cache Optimization: Offers a method to eliminate redundant copying of the KV cache post-inference by utilizing sub-tensor technology, thereby reducing memory consumption and inference time.

Experimental Evaluation

The empirical evaluation demonstrates Transformer-Lite's superior performance compared with CPU-based FastLLM and GPU-based MLC-LLM engines. Remarkable speedups of over 10x for prefill and 2~3x for decoding speeds affirm the efficacy of the proposed optimizations across various LLM architectures and parameter sizes ranging from 2B to 14B. Furthermore, the engine's success in deploying models up to 14B parameters on mobile devices underscores the potential to bring advanced AI applications directly to end-users without compromising performance.

Implications and Future Work

These optimizations have notable implications for the deployment of LLMs on mobile devices, offering a pathway to achieving high-efficiency, real-time AI applications directly on user devices. The advancements not only promise improved user experiences by enabling faster inference times but also hint at a significant reduction in reliance on cloud-based models, thus enhancing privacy and accessibility of AI technologies. Looking forward, the exploration of more efficient matrix multiplication implementations, the incorporation of additional acceleration techniques, and the refinement of model structures represent potential areas for further improvement in deploying LLMs on mobile GPUs.

Summary

This study presents Transformer-Lite, a new mobile inference engine integrating a suite of optimization techniques for efficient deployment of LLMs on mobile devices. The proposed methods demonstrate remarkable speed improvements, establishing a solid foundation for the future development of mobile-based AI applications. The exploratory findings hint at significant potential for advancements in on-device AI processing, suggesting an exciting trajectory for research and development in mobile AI technologies.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.