Emergent Mind

LLM Inference Serving: Survey of Recent Advances and Opportunities

(2407.12391)
Published Jul 17, 2024 in cs.DC and cs.AI

Abstract

This survey offers a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since the year 2023. We specifically examine system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. By selecting and reviewing high-quality papers from prestigious ML and system venues, we highlight key innovations and practical considerations for deploying and scaling LLMs in real-world production environments. This survey serves as a valuable resource for LLM practitioners seeking to stay abreast of the latest developments in this rapidly evolving field.

Prefill and decoding phase during large language model inference.

Overview

  • This survey paper reviews recent advancements in Large Language Model (LLM) inference serving systems, with a focus on techniques that enhance performance and scalability post-2023.

  • Key innovations discussed include efficient memory management, advanced task scheduling, cost-effective cloud deployment, and promising emerging research fields such as Retrieval Augmented Generation and Mixture-of-Experts inference.

  • The paper aims to provide practical insights for deploying and scaling LLMs in production environments while addressing both computational and memory challenges.

Advances in Large Language Model Inference Serving Systems

The survey paper "LLM Inference Serving: Survey of Recent Advances and Opportunities" offers a detailed examination of the recent advancements in Large Language Model (LLM) inference serving systems, particularly focusing on system-level research post-2023. By meticulously selecting a wide range of high-quality papers from leading ML and system venues, the authors provide a thorough overview of the state-of-the-art techniques employed to enhance the performance and scalability of LLM inference in production environments. This essay explore the key areas of innovation explored in the paper, while also discussing the implications and future directions of this research.

Introduction

LLMs have seen widespread adoption since the introduction of models such as ChatGPT. However, serving these models in production environments poses significant challenges due to their substantial computational and memory demands. This paper systematically categorizes and reviews recent system-level advancements in LLM inference serving, providing crucial insights for practitioners aiming to deploy and scale LLMs efficiently.

Memory Management and Caching

Efficient Management of KV Cache

Key-value (KV) cache management is essential to handle the dynamically growing attention states during LLM inference. Techniques such as PagedAttention and vAttention offer innovative solutions. PagedAttention utilizes non-contiguous memory blocks to reduce memory waste, while vAttention simplifies memory management by retaining KV cache in contiguous virtual memory. Application-specific methods like Prompt Cache and AttentionStore further optimize KV cache efficiency by reusing pre-defined prompt schemas and employing intelligent pre-fetching and eviction strategies.

Support for Long-Context Applications

Handling long-context LLM applications remains challenging due to the substantial memory requirements. Innovations like Ring attention and Infinite-LLM propose distributed approaches to manage longer sequences efficiently. Other solutions like InfiniGen and LoongServe leverage GPU-CPU memory offloading and dynamic sequence parallelism to optimize resource usage and reduce overhead.

Compression of KV Cache

Compression techniques are explored to mitigate the large memory footprint of LLM serving. Solutions such as FlexGen, KIVI, and Gear employ various quantization and low-rank approximation methods to compress the KV cache. MiniCache leverages the high similarity between adjacent layers' KV caches to merge them, reducing redundancy without compromising performance.

Computation Task Scheduling

Request Batching

Batching requests is a crucial strategy for improving GPU utilization during LLM inference. Techniques such as Response Length Perception and Sequence Scheduling predict response lengths to batch similar requests together. Alternatively, continuous batching at the token level, as implemented in Orca and DeepSpeed-FastGen, dynamically schedules new requests to maximize throughput without relying heavily on predictors.

Disaggregated Inference

Disaggregated inference separates the prefill and decoding stages of LLM inference to prevent interference between batch-like jobs and latency-critical tasks. Solutions like TetriInfer and Splitwise optimize resource allocation by scheduling tasks independently, leveraging specialized hardware for each phase.

Model Parallelism

Model parallelism is essential for handling the large parameter sizes of LLMs. Techniques such as the analytical model by Pope et al., HeteGen's heterogeneous parallel computing algorithm, and ExeGPT's optimal schedule control variable selection collectively enhance parallel execution across multiple GPUs. Helix's max-flow problem formulation further improves model partitioning in heterogeneous environments.

LLMs in the Cloud

Cloud Deployment Cost

Spot and serverless instances offer cost-effective deployment options. SpotServe introduces mechanisms to handle preemptions and resume interrupted requests efficiently. ServerlessLLM optimizes model loading and cold start latency issues. Mélange and Llumnix provide frameworks for cost-efficient resource allocation and dynamic scheduling, ensuring optimal performance and cost savings.

Cloud Efficiency

With power becoming a bottleneck in cloud datacenters, solutions like POLCA manage power consumption dynamically. PerLLM integrates edge and cloud computing to optimize resource usage and minimize energy costs. FlexLLM and Andes address co-serving and user experience metrics to enhance cloud efficiency and optimize GPU resource allocation.

Emerging Research Fields

Retrieval Augmented Generation

RAG techniques enhance LLMs by incorporating external information sources. Sparse RAG and RAGCache reduce the computational overhead by selectively computing and caching relevant knowledge, respectively.

Mixture-of-Experts Inference

MoE models improve efficiency by activating only a subset of experts for each input. Innovations like Lina, ExFlow, SiDA-MoE, and MoE-Infinity address communication bottlenecks, expert offloading, and efficiency optimization to enhance MoE inference performance.

Miscellaneous Fields

Ethical considerations and environmental sustainability are addressed by solutions like Virtual Token Counter and Sprout. Inference pipeline optimization and frugal inference techniques such as FlashDecoding++, Parrot, and RouteLLM further enhance the performance and cost-effectiveness of LLM serving.

Conclusion

This survey highlights the significant advancements in LLM serving systems, covering memory management, computation scheduling, cloud deployment, and emerging research fields. These innovations collectively pave the way for more efficient and scalable LLM deployments, addressing both practical and theoretical challenges in the field. The ongoing research and future developments promise to further enhance the capabilities and efficiency of LLM serving systems, making them more accessible and sustainable for a wide range of applications.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.