LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning (2401.01325v3)
Abstract: It is well known that LLMs cannot generalize well to long contexts whose lengths are larger than the training sequence length. This poses challenges when employing LLMs for processing long input sequences during inference. In this work, we argue that LLMs themselves have inherent capabilities to handle long contexts without fine-tuning. To achieve this goal, we propose SelfExtend to extend the context window of LLMs by constructing bi-level attention information: the grouped attention and the neighbor attention. The grouped attention captures the dependencies among tokens that are far apart, while neighbor attention captures dependencies among adjacent tokens within a specified range. The two-level attentions are computed based on the original model's self-attention mechanism during inference. With minor code modification, our SelfExtend can effortlessly extend existing LLMs' context window without any fine-tuning. We conduct comprehensive experiments on multiple benchmarks and the results show that our SelfExtend can effectively extend existing LLMs' context window length. The code can be found at \url{https://github.com/datamllab/LongLM}.
- amazon. Mistrallite model. https://huggingface.co/amazon/MistralLite, 2023. [Online; accessed 29-December-2023].
- L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Clex: Continuous length extrapolation for large language models. arXiv preprint arXiv:2310.16450, 2023a.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023b.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023c.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- gkamradt. Llmtest_needleinahaystack: Doing simple retrieval from llm models. https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main, 2023. [Online; accessed 29-December-2023].
- Improving zero and few-shot generalization in dialogue through instruction tuning. arXiv preprint arXiv:2205.12673, 2022.
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Rethinking positional encoding in language pre-training. arXiv preprint arXiv:2006.15595, 2020.
- Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023a.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023b.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624, 2021.
- Sparsebert: Rethinking the importance analysis in self-attention. In International Conference on Machine Learning, pp. 9547–9557. PMLR, 2021.
- RoFormer: Enhanced transformer with rotary position embedding, 2022. arXiv: 2104.09864.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, pp. 127063, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
- Yin Song and Chen Wu and Eden Duthie. amazon/MistralLite, 2023. URL https://huggingface.co/amazon/MistralLite.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.