PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (2312.12456v2)

Published 16 Dec 2023 in cs.LG and cs.OS

Abstract: This paper introduces PowerInfer, a high-speed LLM inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. The evaluation shows that PowerInfer significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU. For the OPT-30B model, PowerInfer achieves performance comparable to that of a high-end server-grade A100 GPU, reaching 82% of its token generation rate on a single consumer-grade RTX 4090 GPU.

References (53)

Citations (70)

View on Semantic Scholar

Summary

The paper presents a novel GPU-CPU hybrid inference engine that optimizes LLM execution by preloading frequently activated 'hot' neurons onto consumer-grade GPUs.
It employs adaptive predictors and neuron-aware sparse operators to minimize data transfers and boost computational efficiency without sacrificing accuracy.
Evaluations show significant speed improvements in token generation rates, achieving near server-grade performance on common consumer GPUs.

Background and Current Challenges

LLMs have become critical tools in various applications, from creative writing to natural language processing. While LLMs have traditionally been run on powerful server-grade GPUs, the trend is shifting towards running them on personal computers with consumer-grade GPUs. The motivation behind this shift includes enhanced data privacy, the potential for model customization, and reduced costs. However, consumer GPUs face significant memory constraints when it comes to hosting the substantial parameter sets required by LLMs, making efficient local LLM inference an important yet challenging task.

PowerInfer: A Novel Inference Engine

PowerInfer introduces a novel GPU-CPU hybrid inference engine that embraces the locality of neuron activations in LLM inference. By differentiating between frequently activated 'hot' neurons and irregularly activated 'cold' neurons, PowerInfer can preload hot neurons onto the GPU for quick access. The design incorporates adaptive predictors to optimize neuron activation efficiency and employs neuron-aware sparse operators that interact with individual neurons, thus omitting unrequired operations on entire matrices. This methodology greatly utilizes available resources, minimizing the need for expansive data transfers between the GPU and CPU and enabling significantly faster inference speeds with maintained model accuracy.

Implementation and Compatibility

The online inference engine is realized by extending existing LLM frameworks with additional implementations in C++ and CUDA, while the offline component utilizes a Python-based profiler and solver to categorize neurons and construct a neuron placement policy. PowerInfer's flexible configuration supports a wide range of LLM families and GPU types, from the high-end NVIDIA RTX 4090 to the older RTX 2080Ti models. Notably, even on consumer-grade GPUs, PowerInfer achieves performance close to that of server-grade GPUs without sacrificing accuracy.

Evaluations and Insights

Performance evaluations show that PowerInfer outpaces existing alternatives, offering considerable speedups in token generation rates for quantized and non-quantized models. Moreover, PowerInfer maintains near-identical accuracy across various LLM models and tasks, ensuring that the efficiency gains are not at the expense of performance quality.

Conclusion

The paper presents PowerInfer, an inference system that harnesses the power-law distribution in neuron activation to optimize the efficiency of local LLM deployments. By strategically splitting the workload between GPU and CPU and focusing on computational locality, PowerInfer affirms its potential as a solution to the challenge of running LLMs on personal computers effectively.

PDF Markdown

Related Papers

GitHub

GitHub - SJTU-IPADS/PowerInfer: High-speed Large Language Model Serving on PCs with Consumer-grade GPUs (7,063 stars)

Tweets

https://twitter.com/IlyasHairline/status/1798589741652447345

https://twitter.com/878228318977503233/status/1737799378612445562

https://twitter.com/22146921/status/1737954671623606597

https://twitter.com/IlyasHairline/status/1785406223547760725

https://twitter.com/123543935/status/1737682846263693505

https://twitter.com/3376145139/status/1741020556345098376