Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Hydragen: High-Throughput LLM Inference with Shared Prefixes (2402.05099v2)

Published 7 Feb 2024 in cs.LG

Abstract: Transformer-based LLMs are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end CodeLlama-13b throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a large batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.

Citations (25)

Summary

  • The paper introduces an exact transformer attention mechanism that decomposes shared prefixes and unique components to reduce redundant memory reads.
  • It utilizes inter-sequence batching to convert many matrix-vector operations into efficient matrix-matrix computations, enhancing GPU utilization significantly.
  • Experimental results show Hydragen maintains less than 15% throughput degradation at 16,000-token prefixes, outperforming baselines that drop over 90%.

Background and Motivation

The efficient execution of LLM inference tasks, particularly for batches of sequences with shared prefixes, is critical for applications like chatbots and domain-specific assistants. Traditional methods can become hampered by computationally expensive attention operations during batch processing. These operations can degrade performance by frequently accessing large key-value (KV) caches from memory, generating inefficient matrix-vector products for each sequence. Understanding and overcoming these bottlenecks is particularly pertinent as transformer-based LLMs continue to scale, with pronounced implications for deployment efficiency and throughput.

Hydragen Introduction

"Hydragen" is introduced as a novel, exact implementation of transformer attention that addresses the inefficiencies associated with shared prefixes in batched LLM inference scenarios. The two primary innovations of Hydragen are:

  1. Attention Decomposition: This process separates the shared and unique components of the sequences, calculating attention efficiently across the shared prefix and then individually handling the distinct suffixes. This not only reduces unnecessary memory reads but also permits a streamlined recombination of attention results.
  2. Inter-Sequence Batching: Leveraging decomposed attention, Hydragen batches attention queries for the shared prefix across all sequences, converting numerous matrix-vector products into more efficient matrix-matrix ones. This adaptation significantly raises arithmetic intensity, enhancing hardware utilization, especially on GPUs with tensor cores optimized for matrix math.

Experimental Results and Insights

Through extensive benchmarks, Hydragen has demonstrated significant performance improvements. In batch processing settings, Hydragen accelerates throughput by up to 32x compared to high-performance baselines like vLLM, with growing benefits observed in tandem with increasing batch sizes and prefix lengths. Moreover, even when the prefix length reaches 16,000 tokens, Hydragen's throughput degrades by less than 15%, while its counterparts suffer over a 90% dropout. Additionally, Hydragen's framework extends beyond simple prefix-suffix splits, aiding in more complex tree-based sharing scenarios—showcasing a 55% reduction in inference time on competitive programming problems.

Conclusion

Hydragen exemplifies the impact of hardware-aware optimizations on LLM inference throughput, especially within large-batch, shared-prefix settings. Its ability to deploy tensor core advantages and eliminate redundant memory reads without the need for custom hardware-specific code positions Hydragen as a potentially universal optimization cog for LLM deployment, including future hardware platforms such as TPUs. Further research inspired by this implementation could lead to more computationally savvy LLM usage that maximizes provided context and efficiency.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 13 tweets and received 268 likes.

Upgrade to Pro to view all of the tweets about this paper: