Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .
The paper introduces Unlimiformer, a novel technique that allows transformers to process input sequences of unlimited length without increasing computational cost, by leveraging $k$-nearest-neighbor ($k$NN) retrieval for cross-attention computation.
Unlimiformer enhances existing encoder-decoder transformers by using $k$NN indices to retrieve and attend to the most relevant token embeddings, reducing the complexity of cross-attention and allowing for sub-linear time queries during decoding.
Empirical evaluations show that Unlimiformer significantly improves performance on long-document summarization tasks, scaling to input lengths of up to 500k tokens, and successfully integrates with pretrained models like BART without additional training.
The paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input" introduces a novel approach for extending the length of input sequences that transformers can process, without the commonly associated computational cost increase. Concretely, the authors introduce Unlimiformer, a method that leverages $k$-nearest-neighbor ($k$NN) retrieval to offload cross-attention computation. This approach theoretically and empirically removes the limitation on input length for transformers while maintaining their performance.
The primary contribution of this paper is the Unlimiformer technique, which facilitates unlimited input length by incorporating the $k$NN index. Rather than modifying the underlying architecture of existing transformers or requiring additional learned parameters, the Unlimiformer adapts existing encoder-decoder transformers. It does this by:
Retrieval-based Cross-Attention: Cross-attention in traditional transformers necessitates attending to all tokens, which results in a quadratic complexity with respect to the input length. Unlimiformer addresses this by using $k$NN indices to retrieve top-$k$ token embeddings relevant to the current decoding step. This reduces the complexity and allows attention over theoretically unlimited input sequences. The $k$NN distances serve as the attention dot-product scores.
Encoding and Indexing: The approach begins by encoding overlapping chunks of the input sequence using the encoder, followed by constructing a $k$NN index over these encoded tokens. The attention heads in the decoder then utilize this index to perform sub-linear time queries, dynamically attending to different parts of the input as needed.
The paper substantiates the Unlimiformer's efficacy through comprehensive evaluation on long-document and book summarization tasks. Notably, the method is demonstrated to scale to input lengths of up to 500k tokens without input truncation during inference.
Evaluations on various long-document summarization datasets, including the BookSum dataset, show that Unlimiformer significantly enhances the performance of base models such as BART and Longformer. Key numerical results include:
The practical implications of this research are multifaceted:
Theoretically, the use of nearest-neighbor retrieval in recalibrating cross-attention for large inputs provides a novel perspective on managing transformer scalability. Moreover, by showing that 99% of the attention mass can be preserved through top-$k$ retrieval, the paper affirms that $k$NN indices are robust mechanisms for attention in expansive contexts.
The use of retrieval mechanisms like $k$NN for attention in Unlimiformer highlights potential future directions in AI research:
In conclusion, the Unlimiformer represents a significant advancement in the field of transformer models by enabling the use of practically unlimited length input sequences. This capability, combined with its compatibility with pretrained models and efficient computational requirements, marks a meaningful progression in addressing the limitations of transformer architectures for long-range contextual processing.