Emergent Mind

Abstract

LLMs have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

Distributed inference using oneCCL.

Overview

  • The paper proposes optimization strategies for Large Language Model (LLM) inference on CPUs, addressing computational constraints in environments with limited GPU availability.

  • Key contributions include the SlimAttention mechanism, INT8 KV cache optimization, and a distributed inference optimization framework to reduce latency and enhance throughput.

  • Experimental results show considerable performance improvements, suggesting that CPUs can effectively support real-time LLM applications traditionally requiring GPUs.

Inference Performance Optimization for LLMs on CPUs

The paper titled "Inference Performance Optimization for LLMs on CPUs" by Pujiang He et al. addresses a critical concern in the deployment of LLMs: the optimization of inference performance on CPU hardware. Given the significant computational resources typically required for LLM inference, this work offers substantial practical contributions, particularly in environments constrained by limited GPU availability. The proposed solutions encompass various facets of performance optimization, namely reduction of KV cache size, introduction of a distributed inference optimization framework, and specific model-dependent optimizations.

Key Components of the Proposed Solution

1. SlimAttention Mechanism

One of the noteworthy contributions is the introduction of the SlimAttention mechanism. Traditional attention mechanisms, particularly the resource-intensive FlashAttention, often involve two-dimensional decomposition of the score and necessitate iterative corrections. In contrast, SlimAttention proposes a one-dimensional decomposition of the score between the query and key matrices, simplifying the computational process. This method effectively reduces memory usage by maintaining a smaller buffer, thereby enhancing computational efficiency. Table \ref{llm opt} in the paper demonstrates that SlimAttention significantly improves performance for large input sequences compared to FlashAttention, with marked reductions in computation time (e.g., 540.14 ms to 392.80 ms for input size 4096).

2. Effective KV Cache Optimization

The computational demand of maintaining a KV cache for LLMs is substantial, often surpassing the size of the model's weight data. The authors propose an INT8 KV cache optimization method to mitigate this issue. By quantizing the cache to INT8 while maintaining precision through unique scaling for each token and head, they achieve significant memory reduction. Furthermore, a custom kernel supports hybrid data types, dynamically converting INT8 to FP32 during execution to utilize AVX512 FMA instructions efficiently. This innovative approach allows for maintaining inference performance without compromising model output quality.

3. Distributed Inference Optimization

To address scalability and latency concerns in LLM deployment on CPUs, the authors implement a distributed inference optimization solution using the oneAPI Collective Communications Library (oneCCL). This approach broadcasts token IDs rather than embedding values, reducing communication overhead. An aggressive zero-copy optimization is employed wherein computation results are written directly to the communication modality's location, eliminating the intermediate data copying steps. The experimental results show that this approach achieves a nearly 2.85x reduction in latency (from 249.7 ms to 87.7 ms), which is substantial for real-time applications.

Experimental Results and Practical Implications

The experiments conducted on the Intel Xeon CPU 8563C provide compelling evidence of the proposed solutions' efficacy. Notably, the SlimAttention mechanism consistently outperforms FlashAttention across varying sequence lengths. The distributed inference optimization framework significantly reduces latency, and the effective KV cache optimization enhances throughput markedly for large batch sizes.

Practically, these improvements imply that LLM inference can be deployed more effectively on CPU hardware, expanding accessibility and reducing dependency on expensive GPU resources. This can particularly benefit low-resource environments and applications needing extensive multitasking capabilities. The scalability and performance efficiency demonstrated suggest that CPUs can be leveraged for real-time LLM applications previously thought viable only on GPU platforms.

Theoretical Implications and Future Directions

Theoretically, this work underscores the potential of CPU-based inference through refined optimization strategies. The SlimAttention approach and INT8 quantization methods pave the way for further research into resource-efficient LLM architectures. Future developments could explore optimization techniques for even larger models and more extensive batch sizes, possibly incorporating elements from this paper's proposed methods.

An intriguing direction for future work involves adapting these solutions for emerging LLM architectures such as Mixture of Experts (MoE) models. These models dynamically select a subset of model parameters during inference, which could benefit immensely from the efficiency techniques discussed.

Conclusion

In summary, this paper by He et al. provides a comprehensive suite of optimization strategies for deploying LLMs on CPUs, demonstrating substantial improvements in performance metrics. The practical implications are significant, offering a viable alternative to GPU-based deployments and fostering broader accessibility to advanced LLM applications. Future research should continue to build on these foundational optimizations, exploring their applicability to a wider variety of models and deployment scenarios. The methodologies and results discussed here represent a pivotal advancement in the field of LLM inference optimization on CPU hardware.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube