Emergent Mind

Extending Context Window of Large Language Models via Positional Interpolation

(2306.15595)
Published Jun 27, 2023 in cs.CL , cs.AI , and cs.LG

Abstract

We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.

Position Interpolation expands LLaMA's context window faster than direct fine-tuning, achieving full size in 100 steps.

Overview

  • This paper introduces a method called Position Interpolation (PI) for extending the context window sizes of LLMs like LLaMA, enabling them to process up to 32768 tokens with minimal fine-tuning.

  • PI down-scales input position indices, allowing these models to handle longer contexts more stably than direct extrapolation methods by interpolating position encodings at neighboring positions.

  • Experimental results show that LLaMA models extended with PI demonstrate improved performance on long-context tasks and preserve their performance on shorter tasks, all with negligible cost compared to pre-training.

  • The paper suggests that PI's ability to extend context windows without extensive retraining or architectural changes makes it a versatile tool for expanding the use of LLMs in processing long documents or conducting extended conversations.

Efficient Extension of Context Window Sizes in Pretrained LLMs via Position Interpolation

Introduction

Expanding the context window sizes of LLMs, including the increasingly utilized LLaMA models, presents a computational and logistical challenge, particularly when aiming to embrace applications that necessitate processing extensive sequences. Traditional methodologies to enhance context window sizes involve extensive retraining, often requiring substantial compute resources. This paper introduces Position Interpolation (PI), a novel approach that enables the extension of context window sizes in RoPE-based pretrained LLMs, such as LLaMA, to unprecedented lengths (up to 32768 tokens), with minimal fine-tuning. Remarkably, models employing PI demonstrate proficiency in tasks demanding prolonged context, including language modeling and document summarization, while effectively preserving performance on tasks within the original context limit.

Methodology

Extended Context via Position Interpolation (PI)

The essence of PI lies in down-scaling input position indices to fit within the pre-existing context window limits of a model, thus bypassing the limitations of direct extrapolation methods, which have been shown to result in unstable attention scores. By interpolating the position encodings at neighboring integer positions, PI ensures that extended models can adapt to longer contexts with greater stability. This method retains the original architecture of the models, allowing for most pre-existing optimizations and infrastructure to be reused effectively.

Theoretical Underpinnings and Empirical Validation

The theoretical investigation of PI unveils that the upper bound of interpolated attention score is substantially smaller (~600 times in the context of the LLaMA 7B model) than that of extrapolated attention scores, establishing the stability of this method. Empirically, LLaMA models extended via PI exhibit improved perplexity on long context tasks, thereby validating the theoretical propositions.

Experimental Results

Extending the LLaMA models' context window to sizes up to 32768 via PI requires minimal fine-tuning steps (~1000 steps) on the Pile dataset. This process incurs negligible costs compared to pre-training efforts. The extended models, through various context window sizes, demonstrated not only proficiency in handling tasks requiring extended contexts but also maintained relative performance on tasks designed for shorter contexts.

In particular, for tasks such as language modeling and long document summarization, the models showed significant gains in perplexity and competitive performance, respectively, when evaluated against established benchmarks. The ability of models extended via PI to rapidly adapt to longer sequences during fine-tuning stages was notably demonstrated through a synthetic evaluation task of passkey retrieval, suggesting these models can effectively utilize the extended context window.

Implications and Future Directions

The introduction of Position Interpolation as a method to extend context windows of LLMs paves the way for broader application of these models without the need for extensive retraining or architectural modifications. The paper's findings shed light on the inherent flexibility of Transformer models to adapt to extended sequences, thus potentially expanding the horizon for LLM applications in processing long documents or conducting extended conversations. Looking forward, the application of PI to models with different types of positional encodings could further diversify its utility across various LLM architectures, making it a universal tool for context window extension. This research also opens up avenues for exploring other methods of reducing interpolation/extrapolation bounds, which can enrich the existing toolkit for enhancing the capacity of LLMs.

Conclusion

Position Interpolation presents an efficient and theoretically grounded method for extending the context window sizes of pretrained LLMs with minimal fine-tuning. Its practicality, coupled with the ability to reuse existing infrastructure, positions PI as an attractive solution for leveraging the capabilities of LLMs across a wider range of applications that require processing long sequences of text.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. Colt5: Faster long-range transformers with conditional computation
  2. Proof-pile, 2022. https://github.com/zhangir-azerbayev/proof-pile.

  3. Longformer: The long-document transformer. 2020.
  4. Recurrent memory transformer. 2022.
  5. Generating long sequences with sparse transformers. 2019.
  6. Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net, May 2021.
  7. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. https://github.com/togethercomputer/RedPajama-Data.

  8. Transformerxl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2978–2988, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285.
  9. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=YicbFdNTTy.

  11. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  12. Realm: Retrieval-augmented language model pre-training. 2020.
  13. Transformer language models without positional encodings still learn positional information. 2022.
  14. LoRA: Low-Rank Adaptation of Large Language Models
  15. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1419–1436, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.112. https://aclanthology.org/2021.naacl-main.112.

  16. Atlas: Few-shot learning with retrieval augmented language models. 2022.
  17. Retrieval as attention: End-to-end learning of retrieval and reading within a single transformer. 2022.
  18. kaiokendev. Things iḿ learning while training superhot. https://kaiokendev.github.io/til#extending-context-to-8k

  19. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769–6781. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.550.
  20. Relevance-guided supervision for openqa with colbert. Transactions of the Association for Computational Linguistics, 9:929–944, 2021. doi: 10.1162/tacl˙a˙00405.
  21. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, April 2020.
  22. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. https://aclanthology.org/D18-2012.

  23. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. https://aclanthology.org/W04-1013.

  24. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. https://openreview.net/forum?id=Bkg6RiCqY7.

  25. ∞\infty∞-former: Infinite memory transformer. 2021.
  26. Landmark Attention: Random-Access Infinite Context Length for Transformers
  27. Learning to compress prompts with gist tokens. 2023.
  28. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA
  29. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=R8sQPpGCv0.

  30. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=SylKikSYDH.

  31. Combiner: Full attention transformer with sparse computation cost. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp.  22470–22482. Curran Associates, Inc.
  32. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3715–3734, Seattle, United States, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.272.
  33. SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  12007–12021, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.823.

  34. Roformer: Enhanced transformer with rotary position embedding
  35. A length-extrapolatable transformer
  36. Llama: Open and efficient foundation language models
  37. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

  38. Linformer: Self-attention with linear complexity. 2020.
  39. Memformer: A memory-augmented transformer for sequence modeling. 2020.
  40. Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net, April 2022.
  41. Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020. Curran Associates, Inc.
  42. Opt: Open pre-trained transformer language models
  43. Pytorch fsdp: Experiences on scaling fully sharded data parallel

Show All 43