HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing (2405.06067v3)

Published 9 May 2024 in cs.CL and cs.LG

Abstract: Transformer-based LLMs (LLM) have been widely used in language processing applications. However, due to the memory constraints of the devices, most of them restrict the context window. Even though recurrent models in previous works can memorize past tokens to enable unlimited context and maintain effectiveness, they have ``flat'' memory architectures. Such architectures have limitations in selecting and filtering information. Since humans are good at learning and self-adjustment, we believe that imitating brain memory hierarchy is beneficial for model memorization. Thus, we propose the Hierarchical Memory Transformer (HMT), a novel framework that facilitates a model's long-context processing ability by imitating human memorization behavior. Leveraging memory-augmented segment-level recurrence, we organize the memory hierarchy by preserving tokens from early input segments, passing memory embeddings along the sequence, and recalling relevant information from history. Evaluating general LLMing, question-answering tasks, and the summarization task, we show that HMT consistently improves the long-context processing ability of existing models. Furthermore, HMT achieves a comparable or superior generation quality to long-context LLMs with $2 \sim 57\times$ fewer parameters and $2.5 \sim 116\times$ less inference memory, significantly outperforming previous memory-augmented models. Code on Github: https://github.com/OswaldHe/HMT-pytorch.

References (48)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a hierarchical memory mechanism that mimics human sensory, short-term, and long-term memory to extend context length.
It employs a multi-stage training strategy, achieving a 25.5% decrease in perplexity on Wikitext-103 and improved QA accuracy.
HMT is model independent, adding minimal parameters while efficiently scaling for diverse NLP applications like summarization and legal document processing.

Hierarchical Memory Transformer (HMT) for Long Context Processing

Introduction

Transformers have revolutionized NLP, but they do have a limitation: the maximum length of the context they can handle. The typical transformer models, including popular LLMs like Llama 2, process a fixed number of tokens at a time and are not well-suited for tasks requiring very long contexts, such as book summarization or document-based question answering.

The Hierarchical Memory Transformer (HMT) proposes a novel approach to extend the capabilities of transformers for long-context scenarios. It does so by mimicking how human memory works, utilizing a memory-augmented segment-level recurrence to handle longer contexts more effectively.

Hierarchical Memorization in HMT

HMT is designed to imitate the hierarchical structure of human memory, which consists of sensory, short-term, and long-term memory:

Sensory Memory: HMT uses the last few token embeddings from the previous segment, allowing it to process information that is immediately relevant.
Short-term Memory: Each segment is summarized into a single embedding. This summarized embedding is then used to recall relevant information from previously processed segments.
Long-term Memory: HMT maintains a cache of the most recent memory embeddings, effectively transforming it into a long-term memory bank. This cached memory is utilized to recall and integrate information from distant past segments.

Memory Recall Mechanism

The memory recall mechanism is one of the key innovations in HMT. It involves three main steps:

Representation Extraction: The initial part of a segment is used to generate an embedding that summarizes the segment.
Memory Search: This summary embedding is then used as a query to find the most relevant information from the cache of previous memory embeddings using a cross-attention mechanism.
Augmenting Current Segment: The current segment is augmented with the recalled memory before being processed by the transformer model.

Training and Fine-tuning

The training process of HMT is divided into two stages to enhance efficiency:

Initial Training: The model is trained to handle a few unrolled segments without memory recall.
Extended Training: The pre-trained model is then extended with the memory recall mechanism and trained with a larger number of segments.

This multi-stage strategy allows HMT to train faster and achieve better performance on long-context tasks compared to single-stage training.

Experimental Results

HMT was tested using various datasets and transformer models to validate its effectiveness:

General LLMing: In tests with models such as OPT 2.7B and OpenLlamaV2 3B on Wikitext-103 and PG-19, HMT showed significant improvements. For OPT 2.7B, for example, HMT achieved a 25.5% decrease in perplexity on Wikitext-103, indicating much better LLMing performance over long contexts.
Question-Answering Tasks: With the PubMedQA dataset, HMT not only improved long-answer contextual reasoning by 9.81%, but also increased short-answer prediction accuracy by 1.0%.

Practical Implications

HMT offers several practical benefits:

Model Independence: HMT can be applied to any pre-trained model without altering the core architecture. This makes it a versatile enhancement for various transformer-based models.
Efficiency in Handling Long Contexts: By effectively managing long contexts with minimal additional parameters (0.5% to 2%), HMT is suitable for wide applications from book summarization to legal document processing.
Scalability: HMT can be scaled to even larger models and longer contexts with efficient GPU memory management techniques.

Speculations on Future Development

HMT opens the door for further innovations in memory-augmented neural networks:

Integrated Memory Hierarchies: Future developments could explore even more sophisticated memory hierarchies or adaptive memory management systems.
Enhancing Retrieval-Augmented Models: Combining HMT with other retrieval-augmented techniques may yield even more powerful models for long-context understanding and generation tasks.
Edge Device Deployment: Optimizations for deploying HMT on edge devices could unlock its potential for real-time applications in resource-constrained environments.

Conclusion

HMT represents a step forward in the handling of long contexts by LLMs, leveraging a memory system inspired by human cognition. It blends the strengths of recurrent models and transformers to robustly process long documents and text sequences, providing a valuable tool for a broad range of NLP applications.