Efficient Streaming Language Models with Attention Sinks (2309.17453v4)

Published 29 Sep 2023 in cs.CL and cs.AI

Abstract: Deploying LLMs in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-LLM.

References (63)

Citations (403)

View on Semantic Scholar

Summary

The paper introduces StreamingLLM, a framework that uses dedicated attention sink tokens to address memory challenges and maintain model stability during streaming.
It demonstrates that incorporating a few fixed attention sinks significantly anchors attention distribution and supports processing extensive text streams.
Empirical evaluations show up to a 22.2x speedup and maintained accuracy across multiple LLM architectures in live streaming scenarios.

Introduction

The deployment of LLMs in streaming applications demands an approach that addresses the challenges of both extensive memory consumption during decoding and limited generalization to longer text sequences. Existing methods, such as window attention and the sliding window with re-computation, present their own limitations. Window attention fails when text length exceeds the cache size, and sliding window with re-computation, despite its strong performance, suffers from impractical latency for live applications due to quadratic attention complexity.

Attention Sink Phenomenon

The researchers behind StreamingLLM investigated the underlying issue with window attention and identified a key phenomenon they term "attention sink." This concept refers to the allocation of large attention scores towards less relevant initial tokens. Their analysis reveals that LLMs, due to the softmax operation in attention mechanisms, tend to disproportionately focus on these initial tokens, providing a stable 'sink' for attention that doesn't necessarily correlate with semantic significance. The introduction of just four initial tokens as attention sinks can stabilize LLM performance, showcasing that these tokens function primarily as positionally-biased anchors for attention distribution.

StreamingLLM Framework

To combat these challenges, StreamingLLM proposes a novel framework that maintains efficient performance over infinite input sequences without additional fine-tuning. By retaining the Key and Value (KV) states of a finite window of recent tokens alongside a consistent set of attention sink tokens, StreamingLLM sidesteps the model collapse experienced by window attention. Furthermore, this research suggests that pre-training LLMs with a dedicated attention sink token significantly improves performance, facilitating a single token's capacity to act as an attention anchor, thereby optimizing models for streaming deployment.

Evaluation and Performance

Empirical results reinforce the efficacy of StreamingLLM across a variety of model families, such as Llama-2, MPT, Falcon, and Pythia. The framework demonstrates the capacity to perform language modeling with extended texts of over 4 million tokens, achieving up to a 22.2x speedup compared to the sliding window with re-computation baseline. In simulated streaming question-answering environments, StreamingLLM matches the accuracy of standard, non-streaming baselines while maintaining continuous input performance. Additionally, pre-training LLMs with a sink token was shown to preserve or marginally improve the model performance in streaming cases. These findings offer a compelling solution to the deployment of LLMs for real-time applications requiring long-duration interactions and processing substantial text volumes efficiently.

Conclusion

StreamingLLM decouples the intrinsic limitation imposed by an LLM's pre-training attention window, facilitating efficient streaming application with prolonged text without the need to fine-tune models. It represents a significant stride in making the continuous deployment of LLMs more achievable across a breadth of platforms and applications. The insights and methodologies provided could serve as an essential framework for future research and implementation in the field of streaming LLMs.