HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing (2405.06067v3)
Abstract: Transformer-based LLMs (LLM) have been widely used in language processing applications. However, due to the memory constraints of the devices, most of them restrict the context window. Even though recurrent models in previous works can memorize past tokens to enable unlimited context and maintain effectiveness, they have ``flat'' memory architectures. Such architectures have limitations in selecting and filtering information. Since humans are good at learning and self-adjustment, we believe that imitating brain memory hierarchy is beneficial for model memorization. Thus, we propose the Hierarchical Memory Transformer (HMT), a novel framework that facilitates a model's long-context processing ability by imitating human memorization behavior. Leveraging memory-augmented segment-level recurrence, we organize the memory hierarchy by preserving tokens from early input segments, passing memory embeddings along the sequence, and recalling relevant information from history. Evaluating general LLMing, question-answering tasks, and the summarization task, we show that HMT consistently improves the long-context processing ability of existing models. Furthermore, HMT achieves a comparable or superior generation quality to long-context LLMs with $2 \sim 57\times$ fewer parameters and $2.5 \sim 116\times$ less inference memory, significantly outperforming previous memory-augmented models. Code on Github: https://github.com/OswaldHe/HMT-pytorch.
- Demystifying the nvidia ampere architecture through microbenchmarking and instruction-level analysis. In 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. IEEE, 2022.
- Big data for natural language processing: A streaming approach. Knowledge-Based Systems, 79:36–42, 2015.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.
- Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
- Burgin, M. Epistemic information in stratified m-spaces. Information, 2(4):697–726, 2011.
- Hardware accelerators for recurrent neural networks on fpga. In 2017 IEEE International symposium on circuits and systems (ISCAS), pp. 1–4. IEEE, 2017.
- Recurrent neural networks hardware implementation on fpga. arXiv preprint arXiv:1511.05552, 2015.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Lifelong learning for question answering with hierarchical prompts. arXiv preprint arXiv:2208.14602, 2022.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Tapa: a scalable task-parallel dataflow programming framework for modern fpgas with co-optimization of hls and physical design. ACM Transactions on Reconfigurable Technology and Systems, 16(4):1–31, 2023.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Streaming overlay architecture for lightweight lstm computation on fpga socs. ACM Transactions on Reconfigurable Technology and Systems, 16(1):1–26, 2022.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
- Pasta: Programming and automation support for scalable task-parallel hls programs on modern multi-die fpgas. In 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 12–22. IEEE, 2023.
- Ultra-low latency recurrent neural network inference on fpgas for physics applications with hls4ml. Machine Learning: Science and Technology, 4(2):025004, 2023.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593, 2019.
- Performance modeling in cuda streams—a means for high-throughput data processing. In 2014 IEEE international conference on big data (big data), pp. 301–310. IEEE, 2014.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Ret-llm: Towards a general read-write memory for large language models. arXiv preprint arXiv:2305.14322, 2023.
- Efficient memory-enhanced transformer for long-document summarization in low-resource regimes. Sensors, 23(7):3542, 2023.
- Mozer, M. C. A focused backpropagation algorithm for temporal pattern recognition. In Backpropagation, pp. 137–169. Psychology Press, 2013.
- On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Pmlr, 2013.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506, 2020.
- Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp. 31210–31227. PMLR, 2023.
- Lamol: Language modeling for lifelong language learning. arXiv preprint arXiv:1909.03329, 2019.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Augmenting language models with long-term memory. Advances in Neural Information Processing Systems, 36, 2024.
- Memformer: A memory-augmented transformer for sequence modeling. arXiv preprint arXiv:2010.06891, 2020.
- Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
- An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
- Poolingformer: Long document modeling with pooling attention. In International Conference on Machine Learning, pp. 12437–12446. PMLR, 2021.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.