Emergent Mind

Memorizing Transformers

(2203.08913)
Published Mar 16, 2022 in cs.LG , cs.AI , and cs.CL

Abstract

Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus acquiring new knowledge immediately. In this work, we extend language models with the ability to memorize the internal representations of past inputs. We demonstrate that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves language modeling across various benchmarks and tasks, including generic webtext (C4), math papers (arXiv), books (PG-19), code (Github), as well as formal theorems (Isabelle). We show that the performance steadily improves when we increase the size of memory up to 262K tokens. On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.

Overview

  • Proposes a kNN-augmented Transformer that extends attention span without weight updates during learning.

  • Introduces a non-differentiable external memory for accessing distant data points via kNN lookups.

  • Demonstrates that increased external memory size enhances model performance on various tasks.

  • Improvements are sustainable with model scaling and in some cases outperform more complex models.

  • Pre-trained models can be fine-tuned to adopt this external memory, showing practical utility for AI applications.

Introduction

Transformers have revolutionized various fields by demonstrating exceptional performance in tasks such as natural language processing and mathematical reasoning. However, these models have a limitation in terms of the attention span or context length they can handle effectively. This paper proposes a way to extend the attention span of these models without the need for weight updates during learning, allowing Transformers to access and memorize far distant data points.

kNN-Augmented Transformers

The paper proposes a kNN-augmented Transformer architecture that functions by incorporating an external, non-differentiable memory that can be accessed via approximate k-nearest-neighbor (kNN) lookups. This technique retrieves exact (key, value) pairs from an extensive range and does not backpropagate gradients into the memory, aiding in scalability. When gradients aren't backpropagated, keys and values calculated on prior training steps can be reused for large memories, drastically reducing computation.

Experimental Results

The modified architecture has been assessed across various benchmarks. The findings reveal that as the size of the external memory increases, so does the performance of the language model. The tasks include processing long documents, GitHub code repositories, formal proofs, and mathematical papers. The model demonstrates improved performance, progressively learning to look up and use prior information such as definitions and theorems during inference. The results also indicate that this improvement in performance persists when models are upscaled significantly, making the technique robust for larger models.

Future Directions and Findings

The methodology is advantageous as it allows integration with existing LLMs, maintains improvements with scaling, and in certain instances, outperforms models with more trainable parameters by significant margins. Moreover, existing pre-trained models can be fine-tuned to utilize external memory, which further accentuates the practicality of this approach. Going forward, leveraging this capacity to work with extensive knowledge bases and code repositories could open up new pathways for AI applications. The implementation of these techniques could potentially transform how models interact with and memorize information on a large scale.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.