Emergent Mind

Abstract

Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .

Encoding and indexing a 6-token input for cross-attention using Unlimiformer to fit a 2-token limit.

Overview

  • The paper introduces Unlimiformer, a novel technique that allows transformers to process input sequences of unlimited length without increasing computational cost, by leveraging $k$-nearest-neighbor ($k$NN) retrieval for cross-attention computation.

  • Unlimiformer enhances existing encoder-decoder transformers by using $k$NN indices to retrieve and attend to the most relevant token embeddings, reducing the complexity of cross-attention and allowing for sub-linear time queries during decoding.

  • Empirical evaluations show that Unlimiformer significantly improves performance on long-document summarization tasks, scaling to input lengths of up to 500k tokens, and successfully integrates with pretrained models like BART without additional training.

Unlimiformer: Long-Range Transformers with Unlimited Length Input

The paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input" introduces a novel approach for extending the length of input sequences that transformers can process, without the commonly associated computational cost increase. Concretely, the authors introduce Unlimiformer, a method that leverages $k$-nearest-neighbor ($k$NN) retrieval to offload cross-attention computation. This approach theoretically and empirically removes the limitation on input length for transformers while maintaining their performance.

Key Contributions

The primary contribution of this paper is the Unlimiformer technique, which facilitates unlimited input length by incorporating the $k$NN index. Rather than modifying the underlying architecture of existing transformers or requiring additional learned parameters, the Unlimiformer adapts existing encoder-decoder transformers. It does this by:

  1. Retrieval-based Cross-Attention: Cross-attention in traditional transformers necessitates attending to all tokens, which results in a quadratic complexity with respect to the input length. Unlimiformer addresses this by using $k$NN indices to retrieve top-$k$ token embeddings relevant to the current decoding step. This reduces the complexity and allows attention over theoretically unlimited input sequences. The $k$NN distances serve as the attention dot-product scores.

  2. Encoding and Indexing: The approach begins by encoding overlapping chunks of the input sequence using the encoder, followed by constructing a $k$NN index over these encoded tokens. The attention heads in the decoder then utilize this index to perform sub-linear time queries, dynamically attending to different parts of the input as needed.

The paper substantiates the Unlimiformer's efficacy through comprehensive evaluation on long-document and book summarization tasks. Notably, the method is demonstrated to scale to input lengths of up to 500k tokens without input truncation during inference.

Numerical Results

Evaluations on various long-document summarization datasets, including the BookSum dataset, show that Unlimiformer significantly enhances the performance of base models such as BART and Longformer. Key numerical results include:

  • GovReport: Using Unlimiformer, the BART model achieves a ROUGE-1 score of 56.6 and a BERTScore of 68.2, outperforming both the base BART model and the Longformer-Encoder-Decoder (PRIMERA).
  • SummScreen: Unlimiformer improves the ROUGE-1 score of BART from 29.7 to 34.7.
  • BookSum: The method doubles the entity recall compared to the baseline, further emphasizing its capacity to retain and utilize extensive input contexts effectively.

Practical and Theoretical Implications

The practical implications of this research are multifaceted:

  • Compatibility with Pretrained Models: Unlimiformer can be integrated into any pre-existing pretrained encoder-decoder model without requiring additional training. This involves merely fine-tuning, augmenting performance with minimal computational overhead.
  • Scalability: The ability to handle arbitrarily long input sequences opens up new possibilities for tasks requiring processing large documents, such as legal document review and comprehensive literature summarization.
  • Computational Efficiency: Despite offering extended input capabilities, Unlimiformer’s reliance on sub-linear time complexity for $k$NN queries ensures that it remains computationally efficient, both in terms of memory and processing speed.

Theoretically, the use of nearest-neighbor retrieval in recalibrating cross-attention for large inputs provides a novel perspective on managing transformer scalability. Moreover, by showing that 99% of the attention mass can be preserved through top-$k$ retrieval, the paper affirms that $k$NN indices are robust mechanisms for attention in expansive contexts.

Future Developments in AI

The use of retrieval mechanisms like $k$NN for attention in Unlimiformer highlights potential future directions in AI research:

  • Cross-modal Retrieval: The principles could be extended to attention in multimodal transformers, enabling efficient processing of extensive datasets that include text and images.
  • Adaptive Retrieval: Future models may explore adaptive retrieval strategies that dynamically adjust the value of $k$ based on input characteristics, potentially optimizing performance further.
  • Memory-Augmented Models: Building on the efficiency of Unlimiformer, research could delve into hybrid models integrating both dense and sparse memory retrieval to encapsulate rich contextual information efficiently.

In conclusion, the Unlimiformer represents a significant advancement in the field of transformer models by enabling the use of practically unlimited length input sequences. This capability, combined with its compatibility with pretrained models and efficient computational requirements, marks a meaningful progression in addressing the limitations of transformer architectures for long-range contextual processing.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube