Emergent Mind

Long-Context Language Modeling with Parallel Context Encoding

(2402.16617)
Published Feb 26, 2024 in cs.CL

Abstract

Extending LLMs to process longer inputs is crucial for numerous applications. However, the considerable computational cost of transformers, coupled with limited generalization of positional encoding, restricts the size of their context window. We introduce Context Expansion with Parallel Encoding (CEPE), a framework that can be applied to any existing decoder-only LLMs to extend their context window. CEPE adopts a small encoder to process long inputs chunk by chunk and enables the frozen decoder to leverage additional contexts via cross-attention. CEPE is efficient, generalizable, and versatile: trained with 8K-token documents, CEPE extends the context window of LLAMA-2 to 128K tokens, offering 10x the throughput with only 1/6 of the memory. CEPE yields strong performance on language modeling and in-context learning. CEPE also excels in retrieval-augmented applications, while existing long-context models degenerate with retrieved contexts. We further introduce a CEPE variant that can extend the context window of instruction-tuned models with only unlabeled data, and showcase its effectiveness on LLAMA-2-CHAT, leading to a strong instruction-following model that can leverage very long context on downstream tasks.

Comparison of methods to extend LLMs' context window, including YaRN and other techniques.

Overview

  • The paper presents the Context Expansion with Parallel Encoding (CEPE) framework, designed to improve the context handling capabilities of LLMs for dealing with extensive texts.

  • CEPE utilizes a compact encoder for processing long inputs in chunks and a cross-attention module within the decoder to enhance context understanding, improving efficiency and effectiveness.

  • Demonstrates significant improvements in language modeling and in-context learning tasks, including a notable increase in efficiency and a decrease in memory usage when processing Very Long Inputs.

  • Introduces a distilled version of CEPE, CEPE-Distilled (CEPED), aiming to improve long-text handling in instruction-tuned models using unlabeled data, opening avenues for future research in LLM context extension.

Enhancing Context Window in LLMs with CEPE Framework

Introduction

The paper introduces Context Expansion with Parallel Encoding (CEPE), a novel framework devised to augment the context handling capabilities of existing LLMs. This initiative responds to the imperative need for LLMs to parse and comprehend extended contexts, which is essential for a multitude of complex tasks. These tasks range from summarizing extensive documents to answering questions derived from broad compilations of web pages. However, the architectural and computational limitations inherent to the transformer models, alongside the constraints imposed by positional encoding generalization, have traditionally posed challenges to processing long sequences efficiently.

CEPE Architecture

CEPE introduces a two-fold strategy: incorporating a compact encoder for chunk-based long input processing and inserting a cross-attention module within the decoder layers for enriched context understanding. This setup architecturally diverges from decoder-only models by integrating parallel encoding processes that ensure both efficiency and efficacy in handling extended contexts. The encoder encodes segmented inputs, which are then paralleled through a cross-attention mechanism in the decoder, ensuring the model scales with the input length without a drastic increase in computational cost.

Efficiency and Versatility

The introduction of CEPE marks a significant leap in efficiency and versatility for extending context windows in LLMs. Notably, CEPE achieves a marked increase in throughput and a decrease in memory usage when extending the LLaMA-2 model's context window up to 128K tokens. This capability is contrasted against the standard decoding process, which sees a linear increase in memory consumption proportional to the input length. The parallel processing of context chunks and the selective tuning of the encoder and cross-attention modules considerably reduce the computational overhead, making CEPE a practical solution for large-scale deployment.

Practical Applications and Performance

CEPE's utility is demonstrated across a range of tasks, showing notable performance improvements in language modeling, in-context learning, and retrieval-augmented applications. For language modeling, CEPE significantly outperforms existing methods in processing longer inputs with vastly improved efficiency. In retrieval-augmented settings, where leveraging external documents becomes necessary, CEPE exhibits exceptional performance by incorporating more retrieved documents without degradation in output quality. Furthermore, the paper introduces CEPE-Distilled (CEPED) variant, meant to augment instruction-tuned models for better performance on downstream tasks involving long texts, all while utilizing unlabeled data for model extension.

Future Directions

The paper posits CEPE as an enabling technology for future LLM research, focusing on cheap and effective strategies for context extension. While CEPE has shown remarkable improvements in the existing model's ability to handle extended contexts efficiently, possible areas for enhancement include the exploration of different encoder sizes, learning rates, and data mixtures. Moreover, the application of CEPE to a broader array of instruction-tuned models presents an intriguing avenue for further exploration.

Conclusion

The CEPE framework represents a substantial advancement in the capabilities of LLMs to process and understand extended contexts. By strategically modifying the transformer architecture to incorporate a parallel encoding mechanism, CEPE not only improves efficiency and reduces computational costs but also extends the practical usability of LLMs in handling complex tasks involving vast amounts of data. As LLM applications continue to expand, frameworks like CEPE will play a pivotal role in unlocking new potentials and overcoming existing limitations.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube