PowerInfer-2: Fast Large Language Model Inference on a Smartphone (2406.06282v3)

Published 10 Jun 2024 in cs.LG

Abstract: LLMs on smartphones enable real-time AI assistance and privacy-preserving, offline operation. However, resource constraints of smartphones limit current deployments to small LLMs (SLMs), significantly compromising their capabilities. This paper introduces PowerInfer-2, a smartphone-based framework that enables fast inference for LLMs exceeding the memory capacity. The key insight is decomposing matrix operations into neuron clusters as the basic processing unit, which enables flexible scheduling and efficient I/O-computation pipelining. PowerInfer-2 leverages this neuron-cluster-based design in both computation and storage. For computation, neuron clusters with dense activations are processed on NPU, while sparse clusters use CPU. The storage engine provides a fine-grained pipeline mechanism that coordinates cluster-level computation and I/O operations, enhanced by a segmented neuron cache to reduce I/O activities. PowerInfer-2 achieves up to a 27.8x speed increase compared to state-of-the-art frameworks. PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s. Notably, these performance improvements preserve model quality with negligible accuracy degradation.

Citations (20)

View on Semantic Scholar

Summary

The paper presents a novel framework that decomposes matrix computations into fine-grained neuron clusters to efficiently utilize smartphone resources.
The framework employs a polymorphic neuron engine that adaptively uses NPUs and CPUs during different inference stages, achieving up to 29.2x speedup over state-of-the-art methods.
The approach reduces memory usage by 40% for models that fit in device memory, paving the way for privacy-preserving, real-time mobile AI applications.

Overview of "PowerInfer-2: Fast LLM Inference on a Smartphone"

The paper "PowerInfer-2: Fast LLM Inference on a Smartphone" presents PowerInfer-2, a framework designed to achieve high-speed inference of LLMs on smartphones. This work is particularly significant for models whose sizes exceed the device's memory capacity.

Key Contributions

PowerInfer-2 integrates several novel techniques to address the computational and memory limitations inherent in smartphones:

Heterogeneous Resource Utilization:
- The framework leverages the heterogeneous computation, memory, and I/O resources in modern smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. This approach allows for more efficient use of the device's resources and ensures better performance through fine-grained optimization.
Polymorphic Neuron Engine:
- The proposed polymorphic neuron engine adapts computational strategies for different stages of LLM inference. In the prefill stage, large neuron clusters are processed to exploit the strengths of the Neural Processing Unit (NPU). Conversely, the decoding stage uses smaller neuron clusters and prefers the central processing unit (CPU) cores for their flexibility in handling the lighter computational load associated with this phase.
Segmented Neuron Caching:
- To address the storage constraints, PowerInfer-2 implements segmented neuron caching and fine-grained neuron-cluster-level pipelining. These mechanisms effectively hide the I/O overhead by streaming computations to coincide with data fetch operations, thus minimizing latency.
Fine-Grained Pipelining:
- The framework introduces a pipeline that operates at the granularity of neuron clusters, overlapping I/O operations with computations to mitigate I/O latency. This method significantly reduces idle times typically associated with data fetching, resulting in a noticeable speedup in inference times.

Performance and Evaluation

Implementation: PowerInfer-2 was implemented by extending the PowerInfer framework with an additional 12K lines of code and deployed on two smartphones: OnePlus 12 and OnePlus Ace 2.

Models and Techniques: The system supports various LLMs, including the Llama-2 (7B, 13B), TurboSparse-Mistral (7B), and TurboSparse-Mixtral (47B) models. Evaluations demonstrate that PowerInfer-2 achieves:

Up to 29.2x speedup compared to state-of-the-art frameworks like llama.cpp and LLM in a Flash.
The capability to generate 11.68 tokens per second when running the TurboSparse-Mixtral-47B model on a smartphone.

For smaller models that fit entirely within device memory, PowerInfer-2 achieves a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.

Practical and Theoretical Implications

Practical Implications:

The ability to run large LLMs on smartphones enhances the utility of personal devices in performing sophisticated tasks such as real-time language translation, natural language understanding, and conversational AI without relying on cloud-based services, thus improving user privacy.

Theoretical Implications:

The decomposition of matrix computations into neuron cluster computations exemplifies a novel approach to exploiting hardware heterogeneity in neural network inference. This lays a theoretical groundwork for future research in optimizing neural network architectures for diverse and resource-constrained environments.

Future Directions

PowerInfer-2 opens several avenues for future research and development:

Enhanced Sparsity Techniques: Further optimization can be achieved by exploring more sophisticated sparsity-aware mechanisms that can predict and utilize inactive neuron patterns more efficiently.
Extended Generalization: Adapting and scaling the framework for use with other types of neural networks (e.g., convolutional neural networks) and different heterogeneous hardware configurations (e.g., iOS devices).
Speculative Decoding Integration: Combining with speculative decoding frameworks to further reduce latency, especially for highly parallelizable tasks, while managing the I/O bottlenecks intrinsic to extensive model weight offloading.

Conclusion

PowerInfer-2 represents a significant step forward in enabling efficient LLM inference on smartphones, combining innovative techniques in neuron cluster computation, caching, and fine-grained pipelining to achieve substantial performance improvements. Its ability to handle models exceeding device memory size without significant performance degradation shows promise for future developments in mobile AI applications.

Related Papers

Tweets

https://twitter.com/fly51fly/status/1801368238246477886

https://twitter.com/GregorySalvan/status/1822755356789125290

https://twitter.com/betterhn20/status/1801023255894991181

https://twitter.com/SmartFlowAITeam/status/1800782992606179541

https://twitter.com/gm8xx8/status/1800367139657793643

https://twitter.com/FathuraW/status/1809103902769623448

YouTube

Show All Videos

HackerNews

21.2× faster than llama.cpp? plus 40% memory usage reduction (43 points, 14 comments)
PowerInfer-2: Fast Large Language Model Inference on a Smartphone (1 point, 0 comments)