Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Augmenting Language Models with Long-Term Memory (2306.07174v1)

Published 12 Jun 2023 in cs.CL

Abstract: Existing LLMs can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, LLMs Augmented with Long-Term Memory (LongMem), which enables LLMs to memorize long history. We design a novel decoupled network architecture with the original backbone LLM frozen as a memory encoder and an adaptive residual side-network as a memory retriever and reader. Such a decoupled memory design can easily cache and update long-term past contexts for memory retrieval without suffering from memory staleness. Enhanced with memory-augmented adaptation training, LongMem can thus memorize long past context and use long-term memory for LLMing. The proposed memory retrieval module can handle unlimited-length context in its memory bank to benefit various downstream tasks. Typically, LongMem can enlarge the long-form memory to 65k tokens and thus cache many-shot extra demonstration examples as long-form memory for in-context learning. Experiments show that our method outperforms strong long-context models on ChapterBreak, a challenging long-context modeling benchmark, and achieves remarkable improvements on memory-augmented in-context learning over LLMs. The results demonstrate that the proposed method is effective in helping LLMs to memorize and utilize long-form contents. Our code is open-sourced at https://aka.ms/LongMem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, 2007.
  2. Improving language models by retrieving from trillions of tokens. ArXiv, abs/2112.04426, 2021.
  3. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  6. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  7. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  8. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7:535–547, 2021.
  9. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
  10. N-gram nearest neighbor machine translation. arXiv preprint arXiv:2301.12866, 2023.
  11. RoBERTa: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
  12. Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. arXiv preprint cs/0409058, 2004.
  13. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  14. Improving language understanding with unsupervised learning. 2018.
  15. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  16. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2020.
  17. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
  18. Language models are unsupervised multitask learners. 2019.
  19. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250, 2016.
  20. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. arXiv preprint arXiv:2206.06522, 2022.
  21. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  22. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  23. Chapterbreak: A challenge dataset for long-range language models. arXiv preprint arXiv:2204.10878, 2022.
  24. Attention is all you need. In NIPS, 2017.
  25. Visually-augmented language modeling. arXiv preprint arXiv:2205.10178, 2022.
  26. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  27. Memorizing transformers. ArXiv, abs/2203.08913, 2022.
  28. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  29. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2):165–210, 2005.
  30. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  31. Task-oriented dialogue system as natural language generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2698–2703, 2022.
  32. XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.
  33. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  34. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  35. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer, 2020.
Citations (68)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com