Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 167 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 205 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Memorizing Transformers (2203.08913v1)

Published 16 Mar 2022 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision LLMs that can simply read and memorize new data at inference time, thus acquiring new knowledge immediately. In this work, we extend LLMs with the ability to memorize the internal representations of past inputs. We demonstrate that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves LLMing across various benchmarks and tasks, including generic webtext (C4), math papers (arXiv), books (PG-19), code (Github), as well as formal theorems (Isabelle). We show that the performance steadily improves when we increase the size of memory up to 262K tokens. On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.

Citations (155)

View on Semantic Scholar

Summary

The paper introduces a kNN-augmented Transformer with external memory that extends model attention without requiring weight updates.
It uses approximate nearest neighbor lookups to retrieve stored (key, value) pairs, ensuring scalability and efficient long-range processing.
Experimental results demonstrate improved performance on tasks like long document processing, GitHub code analysis, and formal proofs as memory size increases.

Introduction

Transformers have revolutionized various fields by demonstrating exceptional performance in tasks such as natural language processing and mathematical reasoning. However, these models have a limitation in terms of the attention span or context length they can handle effectively. This paper proposes a way to extend the attention span of these models without the need for weight updates during learning, allowing Transformers to access and memorize far distant data points.

kNN-Augmented Transformers

The paper proposes a kNN-augmented Transformer architecture that functions by incorporating an external, non-differentiable memory that can be accessed via approximate k-nearest-neighbor (kNN) lookups. This technique retrieves exact (key, value) pairs from an extensive range and does not backpropagate gradients into the memory, aiding in scalability. When gradients aren't backpropagated, keys and values calculated on prior training steps can be reused for large memories, drastically reducing computation.

Experimental Results

The modified architecture has been assessed across various benchmarks. The findings reveal that as the size of the external memory increases, so does the performance of the LLM. The tasks include processing long documents, GitHub code repositories, formal proofs, and mathematical papers. The model demonstrates improved performance, progressively learning to look up and use prior information such as definitions and theorems during inference. The results also indicate that this improvement in performance persists when models are upscaled significantly, making the technique robust for larger models.

Future Directions and Findings

The methodology is advantageous as it allows integration with existing LLMs, maintains improvements with scaling, and in certain instances, outperforms models with more trainable parameters by significant margins. Moreover, existing pre-trained models can be fine-tuned to utilize external memory, which further accentuates the practicality of this approach. Going forward, leveraging this capacity to work with extensive knowledge bases and code repositories could open up new pathways for AI applications. The implementation of these techniques could potentially transform how models interact with and memorize information on a large scale.