MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning (2407.20999v3)
Abstract: LLMs have demonstrated remarkable capabilities across a wide range of tasks. Typically, LLMs are first pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget some knowledge acquired in the pre-training stage, leading to a decline in general capabilities. Existing approaches to mitigate forgetting often rely on access to pre-training data, which may be unavailable in many real-world scenarios--such as fine-tuning checkpoint-only open-source LLMs. To address this challenge, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). MoFO is an extension of greedy block coordinate descent (BCD) methods: in each iteration, MoFO only updates the model parameters with the largest momentum magnitudes, while keeping all other parameters fixed. MoFO achieves similar fine-tuning performance to the default fine-tuning algorithm while effectively mitigating knowledge forgetting. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its effectiveness in mitigating forgetting without pre-training data.
- Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
- Online continual learning with maximal interfered retrieval. Advances in neural information processing systems, 32, 2019a.
- Gradient based sample selection for online continual learning. Advances in neural information processing systems, 32, 2019b.
- LoRA learns less and forgets less. arXiv preprint arXiv:2405.09673, 2024.
- Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33:15920–15930, 2020.
- Co2l: Contrastive continual learning. In Proceedings of the IEEE/CVF International conference on computer vision, pages 9516–9525, 2021.
- S. Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
- Riemannian walk for incremental learning: Understanding forgetting and intransigence. In European Conference on Computer Vision, pages 556–572, 2018.
- Efficient lifelong learning with A-GEM. In International Conference on Learning Representations, 2019a.
- On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019b.
- Using hindsight to anchor past knowledge in continual learning. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 6993–7001, 2021.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7870–7881, 2020.
- Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Semi-supervised sequence learning. Advances in neural information processing systems, 28, 2015.
- How should pre-trained language models be fine-tuned towards adversarial robustness? Advances in Neural Information Processing Systems, 34:4356–4369, 2021.
- Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773. PMLR, 2020.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
- Knowledge is a region in weight space for fine-tuned language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1350–1370, 2023.
- Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 9769–9776. IEEE, 2019.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
- Iteration complexity analysis of block coordinate descent methods. Mathematical Programming, 163:85–114, 2017.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Continual learning for text classification with information disentanglement based regularization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2736–2746, 2021.
- Hft: Half fine-tuning for large language models. arXiv preprint arXiv:2404.18466, 2024.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
- Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- J. D. M.-W. C. Kenton and L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Controlling conditional language models without catastrophic forgetting. In International Conference on Machine Learning, pages 11499–11528. PMLR, 2022.
- Understanding catastrophic forgetting in language models via implicit inference. In The Twelfth International Conference on Learning Representations, 2024.
- Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning, pages 2825–2834. PMLR, 2018.
- Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
- Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
- More than catastrophic forgetting: Integrating general capabilities for domain-specific llms. arXiv preprint arXiv:2405.17830, 2024.
- D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
- Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152:615–642, 2015.
- Badam: A memory efficient full parameter training method for large language models. arXiv preprint arXiv:2404.02827, 2024.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
- M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
- Regularization techniques for fine-tuning in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1489–1494, 2017.
- Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
- Coordinate descent converges faster with the gauss-southwell rule than random selection. In International Conference on Machine Learning, pages 1632–1641. PMLR, 2015.
- Let’s make block coordinate descent converge faster: faster greedy rules, message-passing, active-set complexity, and superlinear convergence. Journal of Machine Learning Research, 23(131):1–74, 2022.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Task-specific skill localization in fine-tuned language models. In International Conference on Machine Learning, pages 27011–27033. PMLR, 2023.
- Improving language understanding with unsupervised learning. 2018.
- Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2021.
- Encoder based lifelong learning. In Proceedings of the IEEE international conference on computer vision, pages 1320–1328, 2017.
- Analyzing and reducing catastrophic forgetting in parameter efficient tuning. arXiv preprint arXiv:2402.18865, 2024.
- P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1):1–38, 2014.
- Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations, 2019a.
- Scalable recollections for continual lifelong learning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 1352–1359, 2019b.
- Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems, 31, 2018.
- Experience replay for continual learning. Advances in neural information processing systems, 32, 2019.
- Continual learning of large language models: A comprehensive survey. arXiv preprint arXiv:2404.16789, 2024.
- R. Sun and M. Hong. Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. Advances in Neural Information Processing Systems, 28, 2015.
- Gcr: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 99–108, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109:475–494, 2001.
- Orthogonal subspace learning for language model continual learning. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023a.
- Trace: A comprehensive benchmark for continual learning in large language models. arXiv preprint arXiv:2310.06762, 2023b.
- Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 663–677, 2024.
- Efficient meta lifelong-learning with limited memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 535–548, 2020.
- Continual learning for large language models: A survey. arXiv preprint arXiv:2402.01364, 2024.
- J. Xu and J. Zhang. Random masking finds winning tickets for parameter efficient fine-tuning. arXiv preprint arXiv:2405.02596, 2024.
- Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2024.
- Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4031–4047, 2023.
- Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, 2024a.
- Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Forty-first International Conference on Machine Learning, 2024b.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019.
- Continual learning through synaptic intelligence. In International conference on machine learning, pages 3987–3995. PMLR, 2017.
- Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
- Adam can converge without any modification on update rules. Advances in neural information processing systems, 35:28386–28399, 2022.
- Model tailor: Mitigating catastrophic forgetting in multi-modal large language models. arXiv preprint arXiv:2402.12048, 2024.