LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2309.12307v3)
Abstract: We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained LLMs, with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.
- Ntk-aware scaled rope, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
- Neural kaleidoscopic space sculpting. In CVPR, pp. 4349–4358, 2023.
- L-eval: Instituting standardized evaluation for long context language models, 2023.
- Input-tuning: Adapting unfamiliar inputs to frozen pretrained models. CoRR, abs/2203.03131, 2022.
- Proof-pile, 2022. URL https://github.com/zhangir-azerbayev/proof-pile.
- Layer normalization. CoRR, abs/1607.06450, 2016.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020.
- Recurrent memory transformer. In NeurIPS, 2022.
- Pixelated butterfly: Simple and efficient sparse training for neural network models. In ICLR, 2022.
- Extending context window of large language models via positional interpolation. CoRR, abs/2306.15595, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
- Longnet: Scaling transformers to 1, 000, 000, 000 tokens. CoRR, abs/2307.02486, 2023.
- Glm: General language model pretraining with autoregressive blank infilling. In ACL, pp. 320–335, 2022.
- REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909, 2020.
- Lm-infinite: Simple on-the-fly length generalization for large language models. CoRR, abs/2308.16137, 2023.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Few-shot learning with retrieval augmented language models. CoRR, abs/2208.03299, 2022.
- Dense passage retrieval for open-domain question answering. In EMNLP, pp. 6769–6781, 2020.
- Reformer: The efficient transformer. In ICLR, 2020.
- The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), EMNLP, pp. 3045–3059, 2021.
- How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
- Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), ACL, pp. 4582–4597, 2021.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In NeurIPS, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pp. 9992–10002, 2021.
- Decoupled weight decay regularization. In ICLR, 2019.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Landmark attention: Random-access infinite context length for transformers. CoRR, abs/2305.16300, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pp. 8024–8035, 2019.
- Yarn: Efficient context window extension of large language models. CoRR, abs/2309.00071, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022.
- 3d graph neural networks for RGBD semantic segmentation. In ICCV, pp. 5209–5218, 2017.
- Blockwise self-attention for long document understanding. In EMNLP, volume EMNLP 2020 of Findings of ACL, pp. 2555–2565, 2020.
- Compressive transformers for long-range sequence modelling. In ICLR, 2020.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pp. 3505–3506. ACM, 2020.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021.
- Training neural networks with fixed sparse masks. In NeurIPS, pp. 24193–24205, 2021.
- MosaicML NLP Team. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023a. URL www.mosaicml.com/blog/mpt-30b.
- MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023b. URL www.mosaicml.com/blog/mpt-7b.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b.
- Focused transformer: Contrastive training for context scaling. CoRR, abs/2307.03170, 2023.
- Attention is all you need. In NeurIPS, pp. 5998–6008, 2017.
- Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768, 2020.
- Memorizing transformers. In ICLR, 2022.
- Big bird: Transformers for longer sequences. In NeurIPS, 2020.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), ACL, pp. 1–9, 2022.
- Safeconv: Explaining and correcting conversational unsafe behavior. In ACL, pp. 22–35, 2023.
- Pose: Efficient context window extension of llms via positional skip-wise training, 2023.