Emergent Mind

Hydragen: High-Throughput LLM Inference with Shared Prefixes

(2402.05099)
Published Feb 7, 2024 in cs.LG

Abstract

Transformer-based LLMs are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end CodeLlama-13b throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a large batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.

Hydragen decomposes attention for efficient chatbot model processing, leveraging specialized GPU matrix multiplication capabilities.

Overview

  • Hydragen is a novel approach to transformer attention for efficient LLM inference, targeting sequences with shared prefixes.

  • It introduces attention decomposition to separately process shared and unique sequence components, improving memory and computational efficiency.

  • Inter-sequence batching in Hydragen increases arithmetic intensity and enhances hardware utilization on GPUs.

  • Benchmark results show a significant throughput increase, with up to 32x acceleration and robust performance over various batch sizes and prefix lengths.

Background and Motivation

The efficient execution of LLM inference tasks, particularly for batches of sequences with shared prefixes, is critical for applications like chatbots and domain-specific assistants. Traditional methods can become hampered by computationally expensive attention operations during batch processing. These operations can degrade performance by frequently accessing large key-value (KV) caches from memory, generating inefficient matrix-vector products for each sequence. Understanding and overcoming these bottlenecks is particularly pertinent as transformer-based LLMs continue to scale, with pronounced implications for deployment efficiency and throughput.

Hydragen Introduction

"Hydragen" is introduced as a novel, exact implementation of transformer attention that addresses the inefficiencies associated with shared prefixes in batched LLM inference scenarios. The two primary innovations of Hydragen are:

  1. Attention Decomposition: This process separates the shared and unique components of the sequences, calculating attention efficiently across the shared prefix and then individually handling the distinct suffixes. This not only reduces unnecessary memory reads but also permits a streamlined recombination of attention results.
  2. Inter-Sequence Batching: Leveraging decomposed attention, Hydragen batches attention queries for the shared prefix across all sequences, converting numerous matrix-vector products into more efficient matrix-matrix ones. This adaptation significantly raises arithmetic intensity, enhancing hardware utilization, especially on GPUs with tensor cores optimized for matrix math.

Experimental Results and Insights

Through extensive benchmarks, Hydragen has demonstrated significant performance improvements. In batch processing settings, Hydragen accelerates throughput by up to 32x compared to high-performance baselines like vLLM, with growing benefits observed in tandem with increasing batch sizes and prefix lengths. Moreover, even when the prefix length reaches 16,000 tokens, Hydragen's throughput degrades by less than 15%, while its counterparts suffer over a 90% dropout. Additionally, Hydragen's framework extends beyond simple prefix-suffix splits, aiding in more complex tree-based sharing scenarios—showcasing a 55% reduction in inference time on competitive programming problems.

Conclusion

Hydragen exemplifies the impact of hardware-aware optimizations on LLM inference throughput, especially within large-batch, shared-prefix settings. Its ability to deploy tensor core advantages and eliminate redundant memory reads without the need for custom hardware-specific code positions Hydragen as a potentially universal optimization cog for LLM deployment, including future hardware platforms such as TPUs. Further research inspired by this implementation could lead to more computationally savvy LLM usage that maximizes provided context and efficiency.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.