Extended Mind Transformers (2406.02332v1)

Published 4 Jun 2024 in cs.LG and cs.CL

Abstract: Pre-trained LLMs demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should be updated for the keys and values retrieved. This intuitive method uses the model's own key/query system to select and attend to the most relevant memories at each generation step, rather than using external embeddings. We demonstrate the importance of external information being retrieved in a majority of decoder layers, contrary to previous work. We open source a new counterfactual long-range retrieval benchmark, and show that Extended Mind Transformers outperform today's state of the art by 6% on average.

Summary

The paper's main contribution is a method that integrates extended sequence lengths and memory retrieval, eliminating the need for fine-tuning.
The methodology uses advanced positional encoding strategies and sparse attention to improve perplexity on long sequence inputs.
Experimental results show faster inference times and enhanced performance on benchmarks like Wikitext-103 and wikiQA.

Extended Mind Transformers: An Essay

Introduction

The paper "Extended Mind Transformers" by Phoebe Klett and Thomas Ahle at Normal Computing addresses critical shortcomings in the current use of pre-trained LLMs, particularly their handling of long inputs and retrieval of specialized or topical information. The research revisits and revitalizes the method proposed by Memorizing Transformers, refining it to overcome the limitations such as the need for fine-tuning. Klett and Ahle introduce new positional encoding strategies and demonstrate the necessity of incorporating retrieved external information across a majority of decoder layers. The proposed method, termed Extended Mind Transformers, shows a significant performance improvement compared to existing state-of-the-art (SoTA) techniques.

Problem Statement and Objectives

Pre-trained LLMs exhibit remarkable general intelligence, yet they grapple with long inputs during inference, leading to inefficiencies in both attention mechanism and memory retrieval. This problem is decomposed into three sub-problems: enhancing performance by extending maximum input token sequence length, improving the efficiency of the attention mechanism, and optimizing performance through effective memorization and retrieval of relevant information. The authors highlight that solutions addressing these issues are well explored and present their approach to situate within this research continuum.

Methodology

The methodology behind Extended Mind Transformers primarily revolves around leveraging the model's internal key/query system to retrieve and attend to relevant memories at each generation step. By structurally updating the positional encodings for the retrieved keys and values, the authors effectively mitigate shortcomings found in traditional methods. The key methodological advancements include:

Extended Sequence Length: Incorporating methodologies like ALiBi and rotary position embeddings to extend context windows without additional fine-tuning.
Approximate Attention: Utilizing sparse attention factorizations and hardware-aware improvements to handle computational costs effectively.
Retrieval Augmentations: Building on mechanisms such as Neural Turing Machines and Memory Networks, and refining them to suit the transformer architecture, such as with kNN-LM and RAG.

The proposed Extended Mind Transformers integrate external memories into the decoder-only transformer architecture, bypassing the need for fine-tuning. The key contributions include improvements in pre-computation and retrieval of external memories, strategic augmentation of decoder layers, and refined positional encoding methodologies that support long-range attention natively.

Experimental Results

The researchers presented a variety of experiments to validate their approach:

Perplexity Experiments: Conducted on sequences of increasing lengths using Wikitext-103 dataset, Extended Mind Transformers, both MPT-7b and Llama-2-7b, demonstrated lower perplexity compared to baselines, particularly for extended input sequences.
Counterfactual Retrieval Experiments: Utilizing a modified wikiQA benchmark, the performance of Extended Mind Transformers was evaluated against fine-tuned models and composite RAG methods. The results showed Extended Mind Transformers outperforming existing models, particularly in scenarios involving long documents.
Inference Times: Extended Mind Transformers showed significant time-efficiency for tasks involving multiple queries over lengthy documents, quickly amortizing the upfront cost of generating cached key-value pairs.

Theoretical and Practical Implications

The theoretical contributions of Extended Mind Transformers are substantial. By resolving the staleness problem and eliminating the need for fine-tuning, the proposed method enhances the scalability and flexibility of transformers in handling long sequences and incorporating external information dynamically. This advancement parallels the shift from LSTM to transformers, highlighting a progression towards more intuitive and efficient models.

Practically, the research implies far-reaching improvements in applications requiring dynamic retrieval of information, like research databases, legal document analysis, and advanced querying systems. Furthermore, the introduction of causal citations and active learning generation paradigms can significantly enhance the explainability and reliability of AI models, crucial for sensitive applications like healthcare and finance.

Future Directions

The research by Klett and Ahle opens avenues for further developments in AI, such as:

Fine-tuning improved positional encoding strategies to optimize memory utilization.
Enhancing retrieval mechanisms by integrating more sophisticated vector databases and downstream applications.
Expanding the active learning generation techniques to mitigate model hallucinations and improve generation certainty.

Conclusion

Extended Mind Transformers present a significant step forward in the domain of memory-augmented neural networks. By addressing the intrinsic limitations of existing retrieval and attention mechanisms, the contributions of Klett and Ahle pave the way for more efficient and accurate LLMs capable of handling extensive and dynamic input sequences. Their open-sourced benchmarks and methodological advancements should encourage further research and practical adoption, benefiting a broad spectrum of AI applications.

PDF Markdown

Related Papers

Memory Transformer (2020)
$\infty$-former: Infinite Memory Transformer (2021)
MoViT: Memorizing Vision Transformers for Medical Image Analysis (2023)
Not All Memories are Created Equal: Learning to Forget by Expiring (2021)
Neural Knowledge Bank for Pretrained Transformers (2022)

Tweets

https://twitter.com/KlettPhoebe/status/1798368803765801280

https://twitter.com/NormalComputing/status/1800550375214420171

https://twitter.com/amarchenkova/status/1806083485754708463

https://twitter.com/freddie_v4/status/1836507183824896510

https://twitter.com/reedbndr/status/1835050191000391810

https://twitter.com/mytechnotalent/status/1834214778668982665