Emergent Mind

Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon

(2401.03462)
Published Jan 7, 2024 in cs.CL and cs.AI

Abstract

The utilization of long contexts poses a big challenge for LLMs due to their limited context window size. Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities. In this work, we propose a new method called Activation Beacon, which condenses LLM's raw activations into compact forms such that the LLM can perceive a longer context with a limited context window. Activation Beacon is introduced as a plug-in module, which fully preserves the LLM's original capability in short contexts. It works with the sliding window to streamingly process the long context, which leads to a competitive memory and time efficiency in both training and inference. Activation Beacon is trained with short-sequence data of diversified condensing ratios. Thanks to such a treatment, it can be effectively learned to support different context lengths with a small training cost. Our experiment verifies Activation Beacon's effectiveness of context extension: it can remarkably accomplish high-quality extension of Llama-2-7B's context by $\times100$ times (from 4K to 400K); meanwhile, it can also achieve superior performances across a variety of long-context language modeling and understanding tasks. The source code and model checkpoint are available at \url{https://github.com/FlagOpen/FlagEmbedding}.

Appending a beacon token prompts LLM to condense activations for streamlined auto-regression processing.

Overview

  • Language models are limited by a fixed, small text context window size, constraining their use in lengthy document understanding.

  • The Activation Beacon approach allows LLMs to handle expanded context by condensing internal data representations into a more compact form via special 'beacon' tokens.

  • This method is more efficient in training on short sequences and maintains model compatibility, enabling context extension without re-training the entire model.

  • Empirical evidence shows the Activation Beacon approach greatly increases context window size while sustaining performance on language modeling and understanding tasks.

  • Activation Beacon offers a scalable, cost-effective way to enhance LLMs, benefitting longer-form language applications while preserving investments in existing models.

Introduction

Language models (LLMs) have transformed our ability to automate natural language tasks. However, their effectiveness is often shackled by an intrinsic limitation - the ability to consider only a fixed, and relatively short, snippet of text at any given time. This limitation of context window size has been a persistent challenge, restricting the potential uses of LLMs in scenarios where understanding lengthy documents or conversations is crucial. To remedy this, researchers have traditionally resorted to fine-tuning or re-training models to handle longer contexts, a procedure that comes at great computational cost and potential compromise to the model's performance on shorter texts.

The Activation Beacon Approach

In a promising development, researchers have introduced a new methodology called "Activation Beacon", which targets the root of the context limitation problem. Taking cues from insights that LLM activations (the internal data representations the model uses) are information-dense, the Activation Beacon approach condenses these activations into a more compact form. The result? Even with a restricted window of attention, the LLM can access a broader range of context.

Activation Beacon works by inserting special tokens, known as "beacons", at intervals across the input data. These beacons actively condense information, allowing them to carry the essence of much larger text segments. This strategy not only increases the amount of textual content an LLM can consider but does so with remarkable efficiency and without affecting the performance on existing, shorter contexts.

Streamlined Training and Compatibility

A remarkable aspect of Activation Beacon is its ability to train efficiently on short-sequence data, consuming considerably less time and compute resources compared to methods that rely on extensive re-training. The beacons are introduced as a plug-and-play module atop a pre-existing LLM, keeping the original language model parameters fixed. This approach retains model compatibility, letting Activation Beacon potentially extend its context-handling capabilities a hundredfold, effectively stretching a 4K context limit to a staggering 400K.

Empirical Validation

Through comprehensive experiments, the effectiveness of the Activation Beacon was assessed. The results showcased its prowess in extending the context window far beyond existing benchmarks without the extensive costs typically associated with such extensions. The model demonstrated superior language modeling and understanding over long contexts and maintained competitive processing speeds and memory efficiency. The study confirmed that the Activation Beacon could effectively train using a multiplicity of condensing ratios, which diversify its application across varying context lengths.

Conclusion

In conclusion, Activation Beacon stands out as an inventive solution to the context window restriction in LLMs. It is a robust, scalable, and cost-effective module capable of significantly broadening the scope of contexts that LLMs can manage. Activation Beacon's plug-and-play nature coupled with its training efficiency opens up new horizons for longer-form language modeling and understanding tasks. Further, its compatibility ensures that existing LLM investments remain fruitful, adding yet another layer to the versatile applications of language models in modern computational linguistics.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit