Emergent Mind

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

(2406.06282)
Published Jun 10, 2024 in cs.LG

Abstract

This paper introduces PowerInfer-2, a framework designed for high-speed inference of LLMs on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations. The implementation and evaluation of PowerInfer-2 demonstrate its capability to support a wide array of LLM models on two smartphones, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM. For more details, including a demonstration video, please visit the project site at www.powerinfer.ai/v2.

Decoding speeds of PowerInfer-2 and powerful baseline for different sequence lengths on NVIDIA GPUs.

Overview

  • PowerInfer-2 introduces a framework that enables high-speed inference of LLMs on smartphones, even for models that exceed the device's memory capacity.

  • Key techniques include heterogeneous resource utilization, a polymorphic neuron engine, segmented neuron caching, and fine-grained pipelining, which collectively optimize computational efficiency and reduce latency.

  • PowerInfer-2 demonstrates significant performance improvements, achieving up to 29.2x speedup compared to existing frameworks and the capability to generate 11.68 tokens per second for complex models.

Overview of "PowerInfer-2: Fast Large Language Model Inference on a Smartphone"

The paper titled "PowerInfer-2: Fast Large Language Model Inference on a Smartphone" presents PowerInfer-2, a framework designed to achieve high-speed inference of LLMs on smartphones. This work is particularly significant for models whose sizes exceed the device's memory capacity.

Key Contributions

PowerInfer-2 integrates several novel techniques to address the computational and memory limitations inherent in smartphones:

  1. Heterogeneous Resource Utilization:

    • The framework leverages the heterogeneous computation, memory, and I/O resources in modern smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. This approach allows for more efficient use of the device's resources and ensures better performance through fine-grained optimization.
  2. Polymorphic Neuron Engine:

    • The proposed polymorphic neuron engine adapts computational strategies for different stages of LLM inference. In the prefill stage, large neuron clusters are processed to exploit the strengths of the Neural Processing Unit (NPU). Conversely, the decoding stage uses smaller neuron clusters and prefers the central processing unit (CPU) cores for their flexibility in handling the lighter computational load associated with this phase.
  3. Segmented Neuron Caching:

    • To address the storage constraints, PowerInfer-2 implements segmented neuron caching and fine-grained neuron-cluster-level pipelining. These mechanisms effectively hide the I/O overhead by streaming computations to coincide with data fetch operations, thus minimizing latency.
  4. Fine-Grained Pipelining:

    • The framework introduces a pipeline that operates at the granularity of neuron clusters, overlapping I/O operations with computations to mitigate I/O latency. This method significantly reduces idle times typically associated with data fetching, resulting in a noticeable speedup in inference times.

Performance and Evaluation

Implementation: PowerInfer-2 was implemented by extending the PowerInfer framework with an additional 12K lines of code and deployed on two smartphones: OnePlus 12 and OnePlus Ace 2.

Models and Techniques: The system supports various LLMs, including the Llama-2 (7B, 13B), TurboSparse-Mistral (7B), and TurboSparse-Mixtral (47B) models. Evaluations demonstrate that PowerInfer-2 achieves:

  • Up to 29.2x speedup compared to state-of-the-art frameworks like llama.cpp and LLM in a Flash.
  • The capability to generate 11.68 tokens per second when running the TurboSparse-Mixtral-47B model on a smartphone.

For smaller models that fit entirely within device memory, PowerInfer-2 achieves a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.

Practical and Theoretical Implications

Practical Implications:

  • The ability to run large LLMs on smartphones enhances the utility of personal devices in performing sophisticated tasks such as real-time language translation, natural language understanding, and conversational AI without relying on cloud-based services, thus improving user privacy.

Theoretical Implications:

  • The decomposition of matrix computations into neuron cluster computations exemplifies a novel approach to exploiting hardware heterogeneity in neural network inference. This lays a theoretical groundwork for future research in optimizing neural network architectures for diverse and resource-constrained environments.

Future Directions

PowerInfer-2 opens several avenues for future research and development:

  • Enhanced Sparsity Techniques: Further optimization can be achieved by exploring more sophisticated sparsity-aware mechanisms that can predict and utilize inactive neuron patterns more efficiently.
  • Extended Generalization: Adapting and scaling the framework for use with other types of neural networks (e.g., convolutional neural networks) and different heterogeneous hardware configurations (e.g., iOS devices).
  • Speculative Decoding Integration: Combining with speculative decoding frameworks to further reduce latency, especially for highly parallelizable tasks, while managing the I/O bottlenecks intrinsic to extensive model weight offloading.

Conclusion

PowerInfer-2 represents a significant step forward in enabling efficient LLM inference on smartphones, combining innovative techniques in neuron cluster computation, caching, and fine-grained pipelining to achieve substantial performance improvements. Its ability to handle models exceeding device memory size without significant performance degradation shows promise for future developments in mobile AI applications.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube