Emergent Mind

Efficient Memory Management for Large Language Model Serving with PagedAttention

(2309.06180)
Published Sep 12, 2023 in cs.LG and cs.DC

Abstract

High throughput serving of LLMs requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

Memory management in serving a 13B parameter LLM on NVIDIA A100, highlighting \sys's efficiency.

Overview

  • PagedAttention is a novel algorithm inspired by virtual memory techniques from operating systems, optimized for serving LLMs.

  • The PagedAttention method allows for non-contiguous KV cache allocation, reducing waste and improving efficiency.

  • vLLM is a system that leverages PagedAttention to manage memory with nearly zero wasted KV cache, enhancing throughput and model serving flexibility.

  • vLLM's architecture includes a centralized scheduler and distributed GPU workers, and it shows a 2-4x throughput improvement over state-of-the-art systems in evaluations.

  • The system offers significant memory savings and supports a variety of decoding algorithms, making it suitable for real-world LLM applications.

Efficient Memory Management for LLM Serving

Introduction to PagedAttention and vLLM

Serving LLMs efficiently is a complex challenge due to the dynamic nature of their memory requirements. Conventional systems struggle with memory management, often resulting in wasted resources and limited throughput. This paper introduces PagedAttention, an algorithm inspired by virtual memory and paging techniques from operating systems, designed specifically for the serving of LLMs. Building on this innovation, the authors present a system named vLLM that adeptly manages Key-Value (KV) cache memory to minimize waste and enable greater model serving flexibility.

PagedAttention Mechanism

PagedAttention fundamentally rethinks the storage of KV cache for LLMs, allowing for non-contiguous allocation. This paves the way for vLLM to adopt virtual memory-like management techniques, leading to nearly zero waste in KV cache memory. It enables fine-grained allocation, leading to decreased internal and external fragmentation often observed in previous LLM serving systems. Additionally, PagedAttention allows for memory block sharing across requests and sequences, a capability not possible with traditional contiguous memory storage.

vLLM System Architecture

vLLM boasts a unique architecture with a centralized scheduler coordinating distributed GPU workers and a KV cache manager capable of flexible paging. This centralization allows vLLM to dynamically allocate, share, and schedule the memory of multiple requests efficiently. The system also includes custom GPU kernels to accommodate the PagedAttention algorithm and its memory access patterns. Moreover, vLLM supports various decoding algorithms and can adapt to the length variability of input and output sequences, which is critical for real-world applications.

Experimental Results

Evaluations of vLLM showcase throughput improvements of 2-4x over existing state-of-the-art systems like FasterTransformer and Orca without compromising latency or model accuracy. These improvements are even more pronounced under more complex decoding scenarios and with longer sequences. Furthermore, memory savings range between 6.1% to 66.3% depending on the particular decoding algorithm being employed, underscoring the significant efficiency gains achieved through PagedAttention.

Conclusion

PagedAttention and vLLM together address the memory management inefficiencies in serving LLMs. By borrowing concepts from operating systems and innovatively adapting them to the serving of deep learning models, vLLM enables significantly higher throughput and flexibility, pushing the boundaries of what's possible in LLM serving systems. As LLMs continue to grow and become more widely deployed, solutions like vLLM will be vital in ensuring they can be served efficiently and effectively.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube