- The paper introduces a kNN-augmented Transformer with external memory that extends model attention without requiring weight updates.
- It uses approximate nearest neighbor lookups to retrieve stored (key, value) pairs, ensuring scalability and efficient long-range processing.
- Experimental results demonstrate improved performance on tasks like long document processing, GitHub code analysis, and formal proofs as memory size increases.
Introduction
Transformers have revolutionized various fields by demonstrating exceptional performance in tasks such as natural language processing and mathematical reasoning. However, these models have a limitation in terms of the attention span or context length they can handle effectively. This paper proposes a way to extend the attention span of these models without the need for weight updates during learning, allowing Transformers to access and memorize far distant data points.
kNN-Augmented Transformers
The paper proposes a kNN-augmented Transformer architecture that functions by incorporating an external, non-differentiable memory that can be accessed via approximate k-nearest-neighbor (kNN) lookups. This technique retrieves exact (key, value) pairs from an extensive range and does not backpropagate gradients into the memory, aiding in scalability. When gradients aren't backpropagated, keys and values calculated on prior training steps can be reused for large memories, drastically reducing computation.
Experimental Results
The modified architecture has been assessed across various benchmarks. The findings reveal that as the size of the external memory increases, so does the performance of the LLM. The tasks include processing long documents, GitHub code repositories, formal proofs, and mathematical papers. The model demonstrates improved performance, progressively learning to look up and use prior information such as definitions and theorems during inference. The results also indicate that this improvement in performance persists when models are upscaled significantly, making the technique robust for larger models.
Future Directions and Findings
The methodology is advantageous as it allows integration with existing LLMs, maintains improvements with scaling, and in certain instances, outperforms models with more trainable parameters by significant margins. Moreover, existing pre-trained models can be fine-tuned to utilize external memory, which further accentuates the practicality of this approach. Going forward, leveraging this capacity to work with extensive knowledge bases and code repositories could open up new pathways for AI applications. The implementation of these techniques could potentially transform how models interact with and memorize information on a large scale.