Emergent Mind

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

(2312.12456)
Published Dec 16, 2023 in cs.LG and cs.OS

Abstract

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.

Overview

  • PowerInfer is an inference engine optimizing LLMs for consumer-grade GPUs, balancing workload between GPU and CPU.

  • The engine preloads frequently used 'hot' neurons onto the GPU, reducing extensive data transfers and increasing speed.

  • PowerInfer uses adaptive predictors and neuron-aware sparse operators for efficient resource utilization and maintaining model accuracy.

  • The system is compatible with various GPUs and LLM families, bringing server-grade performance to personal computers.

  • Evaluations show that PowerInfer outperforms existing solutions in speed without compromising accuracy in different LLM tasks.

Background and Current Challenges

LLMs have become critical tools in various applications, from creative writing to natural language processing. While LLMs have traditionally been run on powerful server-grade GPUs, the trend is shifting towards running them on personal computers with consumer-grade GPUs. The motivation behind this shift includes enhanced data privacy, the potential for model customization, and reduced costs. However, consumer GPUs face significant memory constraints when it comes to hosting the substantial parameter sets required by LLMs, making efficient local LLM inference an important yet challenging task.

PowerInfer: A Novel Inference Engine

PowerInfer introduces a novel GPU-CPU hybrid inference engine that embraces the locality of neuron activations in LLM inference. By differentiating between frequently activated 'hot' neurons and irregularly activated 'cold' neurons, PowerInfer can preload hot neurons onto the GPU for quick access. The design incorporates adaptive predictors to optimize neuron activation efficiency and employs neuron-aware sparse operators that interact with individual neurons, thus omitting unrequired operations on entire matrices. This methodology greatly utilizes available resources, minimizing the need for expansive data transfers between the GPU and CPU and enabling significantly faster inference speeds with maintained model accuracy.

Implementation and Compatibility

The online inference engine is realized by extending existing LLM frameworks with additional implementations in C++ and CUDA, while the offline component utilizes a Python-based profiler and solver to categorize neurons and construct a neuron placement policy. PowerInfer's flexible configuration supports a wide range of LLM families and GPU types, from the high-end NVIDIA RTX 4090 to the older RTX 2080Ti models. Notably, even on consumer-grade GPUs, PowerInfer achieves performance close to that of server-grade GPUs without sacrificing accuracy.

Evaluations and Insights

Performance evaluations show that PowerInfer outpaces existing alternatives, offering considerable speedups in token generation rates for quantized and non-quantized models. Moreover, PowerInfer maintains near-identical accuracy across various LLM models and tasks, ensuring that the efficiency gains are not at the expense of performance quality.

Conclusion

The paper presents PowerInfer, an inference system that harnesses the power-law distribution in neuron activation to optimize the efficiency of local LLM deployments. By strategically splitting the workload between GPU and CPU and focusing on computational locality, PowerInfer affirms its potential as a solution to the challenge of running LLMs on personal computers effectively.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube