Outlier-weighed Layerwise Sampling for LLM Fine-tuning (2405.18380v3)
Abstract: The rapid advancements in LLMs have revolutionized various natural language processing tasks. However, the substantial size of LLMs presents significant challenges in training or fine-tuning. While parameter-efficient approaches such as low-rank adaptation (LoRA) have gained popularity, they often compromise performance compared to full-rank fine-tuning. In this paper, we propose Outlier-weighed Layerwise Sampling (OWS), a new memory-efficient fine-tuning approach, inspired by the layerwise outlier distribution of LLMs. Unlike LoRA, which adds extra adapters to all layers, OWS strategically assigns higher sampling probabilities to layers with more outliers, selectively sampling only a few layers and fine-tuning their pre-trained weights. To further increase the number of fine-tuned layers without a proportional rise in memory costs, we incorporate gradient low-rank projection, further boosting the approach's performance. Our extensive experiments across various architectures, including LLaMa2 and Mistral, demonstrate that OWS consistently outperforms baseline approaches, including full fine-tuning. Specifically, it achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark, a 3.0% improvement on MMLU, and a notable 10% boost on MT-Bench, while being more memory efficient. OWS allows us to fine-tune 7B LLMs with only 21GB of memory. Our code is available at https://github.com/pixeli99/OWS.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Lora learns less and forgets less. arXiv preprint arXiv:2405.09673, 2024.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
- Freezeout: Accelerate training by progressively freezing layers. arXiv preprint arXiv:1706.04983, 2017.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems (NeurIPs), 2022.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
- Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531, 2022.
- Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021.
- Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- B. M. Hill. A simple general approach to inference about the tail of a distribution. The annals of statistics, pages 1163–1174, 1975.
- Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745, 2023.
- T. Kloek and H. K. Van Dijk. Bayesian estimates of equation system parameters: an application of integration by monte carlo. Econometrica: Journal of the Econometric Society, pages 1–19, 1978.
- Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861, 2023.
- Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454, 2023.
- Bert busters: Outlier dimensions that disrupt transformers. arXiv preprint arXiv:2105.06990, 2021.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Smartfrz: An efficient training framework using attention-based layer freezing. arXiv preprint arXiv:2401.16720, 2024.
- X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Relora: High-rank training through low-rank updates. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023), 2023.
- Stack more layers differently: High-rank training through low-rank updates. arXiv preprint arXiv:2307.05695, 2023.
- Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024.
- Gpt understands, too. arxiv. arXiv preprint arXiv:2103.10385, 2021.
- Autofreeze: Automatically freezing model blocks to accelerate fine-tuning. arXiv preprint arXiv:2102.01386, 2021.
- Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. arXiv preprint arXiv:2106.04489, 2021.
- Traditional and heavy-tailed self regularization in neural network models. arXiv preprint arXiv:1901.08276, 2019.
- Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 505–513. SIAM, 2020.
- Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research, 22(165):1–73, 2021.
- Meta. Llama3. https://github.com/meta-llama/llama3, 2024.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. arXiv preprint arXiv:2403.17919, 2024.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Outliers dimensions that disrupt transformers are driven by frequency. arXiv preprint arXiv:2205.11380, 2022.
- Tied-lora: Enhacing parameter efficiency of lora with weight tying. arXiv preprint arXiv:2311.09578, 2023.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
- S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
- Use chat gpt to solve programming bugs. International Journal of Information technology and Computer Engineering, (31):17–22, 2023.
- Is chatgpt the ultimate programming assistant–how far is it? arXiv preprint arXiv:2304.11938, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Chain of lora: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151, 2024.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
- Heavy-tailed regularization of weight matrices in deep neural networks. In International Conference on Artificial Neural Networks, pages 236–247. Springer, 2023.
- Test accuracy vs. generalization gap: Model selection in nlp without accessing training or testing data. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3011–3021, 2023.
- Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. In International Conference on Machine Learning. PMLR., 2024.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023.
- Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507, 2024.
- P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pages 1–9. PMLR, 2015.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Factual probing is [mask]: Learning vs. learning to recall. arXiv preprint arXiv:2104.05240, 2021.
- Temperature balancing, layer-wise weight analysis, and neural network training. Advances in Neural Information Processing Systems, 36, 2024.