Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Scaling Laws for Forgetting When Fine-Tuning Large Language Models (2401.05605v1)

Published 11 Jan 2024 in cs.CL and cs.LG

Abstract: We study and quantify the problem of forgetting when fine-tuning pre-trained LLMs on a downstream task. We find that parameter-efficient fine-tuning (PEFT) strategies, such as Low-Rank Adapters (LoRA), still suffer from catastrophic forgetting. In particular, we identify a strong inverse linear relationship between the fine-tuning performance and the amount of forgetting when fine-tuning LLMs with LoRA. We further obtain precise scaling laws that show forgetting increases as a shifted power law in the number of parameters fine-tuned and the number of update steps. We also examine the impact of forgetting on knowledge, reasoning, and the safety guardrails trained into Llama 2 7B chat. Our study suggests that forgetting cannot be avoided through early stopping or by varying the number of parameters fine-tuned. We believe this opens up an important safety-critical direction for future research to evaluate and develop fine-tuning schemes which mitigate forgetting

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Uncertainty-Based Continual Learning with Adaptive Regularization. Curran Associates Inc., Red Hook, NY, USA, 2019.
  2. Expert gate: Lifelong learning with a network of experts. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7120–7129, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society. doi: 10.1109/CVPR.2017.753. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.753.
  3. Palm 2 technical report, 2023.
  4. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
  5. Language models are few-shot learners, 2020.
  6. On Tiny Episodic Memories in Continual Learning. arXiv e-prints, art. arXiv:1902.10486, February 2019. doi: 10.48550/arXiv.1902.10486.
  7. Unified scaling laws for routed language models. In International Conference on Machine Learning, pp. 4057–4086. PMLR, 2022.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
  9. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2022. doi: 10.1109/TPAMI.2021.3057446.
  10. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models, 2022.
  11. A bio-inspired incremental learning architecture for applied perceptual problems. Cognitive Computation, 8, 10 2016. doi: 10.1007/s12559-016-9389-5.
  12. An empirical investigation of catastrophic forgeting in gradient-based neural networks. In Bengio, Y. and LeCun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6211.
  13. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  14. Scaling laws for autoregressive generative modeling, 2020.
  15. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  16. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  17. Selective experience replay for lifelong learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. ISBN 978-1-57735-800-8.
  18. Kalajdzievski, D. A rank stabilization scaling factor for fine-tuning with lora, 2023.
  19. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  20. Measuring catastrophic forgetting in neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. doi: 10.1609/aaai.v32i1.11651. URL https://ojs.aaai.org/index.php/AAAI/article/view/11651.
  21. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. URL https://www.pnas.org/doi/abs/10.1073/pnas.1611835114.
  22. Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299, 2019.
  23. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b, 2023.
  24. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pp.  71–78, USA, 2003. Association for Computational Linguistics. doi: 10.3115/1073445.1073465. URL https://doi.org/10.3115/1073445.1073465.
  25. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022.
  26. Gradient episodic memory for continual learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6470–6479, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  27. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2023.
  28. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pp. 109–165. Academic Press, 1989. doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368.
  29. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe.
  30. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
  31. OpenAI. Gpt-4 technical report, 2023.
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  33. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pp.  311–318, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
  34. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2019.01.012. URL https://www.sciencedirect.com/science/article/pii/S0893608019300231.
  35. The challenges of continuous self-supervised learning. In Avidan, S., Brostow, G., Cissé, M., Farinella, G. M., and Hassner, T. (eds.), Computer Vision – ECCV 2022, pp.  702–721, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19809-0.
  36. Ratcliff, R. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review, 97(2):285–308, 1990. doi: 10.1037/0033-295x.97.2.285. URL https://doi.org/10.1037%2F0033-295x.97.2.285.
  37. icarl: Incremental classifier and representation learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5533–5542, 2017. doi: 10.1109/CVPR.2017.587.
  38. Representation stability as a regularizer for improved text analytics transfer learning, 2017. URL https://openreview.net/forum?id=HyenWc5gx.
  39. ROBINS, A. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995. doi: 10.1080/09540099550039318. URL https://doi.org/10.1080/09540099550039318.
  40. Experience replay for continual learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/fa7cdfad1a5aaf8370ebeda47a1ff1c3-Paper.pdf.
  41. Progressive neural networks. CoRR, abs/1606.04671, 2016. URL http://arxiv.org/abs/1606.04671.
  42. Adafactor: Adaptive learning rates with sublinear memory cost, 2018.
  43. Llama 2: Open foundation and fine-tuned chat models, 2023.
  44. Two-stage llm fine-tuning with less specialization and more generalization, 2023.
  45. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  46. Neural domain adaptation for biomedical question answering, 2017.
  47. Reinforced continual learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/cee631121c2ec9232f3a2f028ad5c89b-Paper.pdf.
  48. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022.
  49. Universal and transferable adversarial attacks on aligned language models, 2023.
Citations (6)

Summary

  • The paper finds an inverse linear relationship between fine-tuning performance and forgetting, quantifying the loss of pre-trained knowledge.
  • It introduces a novel cross-entropy metric to rigorously evaluate knowledge retention during fine-tuning.
  • Scaling behaviors follow shifted power law relationships, guiding effective strategies to balance performance and memory retention in LLMs.

Scaling Laws for Forgetting When Fine-Tuning LLMs

This essay provides a detailed examination of the empirical investigation into the phenomenon of forgetting in LLMs during the fine-tuning phase. The paper "Scaling Laws for Forgetting When Fine-Tuning LLMs" (2401.05605) offers rigorous analysis and results that underscore the intricate interplay between fine-tuning performance, parameter count, and the forgetting effect in LLMs. Key aspects of the investigation, including methodological approaches, empirical results, and implications for future studies in artificial intelligence, are discussed below.

Introduction to Forgetting in Fine-Tuning

The paper addresses catastrophic forgetting, a challenge where LLMs lose previously acquired knowledge during the fine-tuning process, especially when fine-tuning on tasks that require knowledge not represented in its pre-training data. This is particularly concerning in safety-critical applications where models need to retain essential capabilities. The investigation revealed that even parameter-efficient fine-tuning (PEFT) methods such as Low-Rank Adapters (LoRA) are susceptible to significant forgetting. Figure 1

Figure 1

Figure 1

Figure 1: Generation examples of the pre-trained model, and a model fine-tuned with LoRA, showing updated knowledge, forgotten knowledge, and forgotten alignment behavior.

Empirical Analysis of Forgetting

Inverse Linear Relationship

The core finding of the paper is an inverse linear correlation between the fine-tuning performance (quantified as loss) and the extent of forgetting. This inverse relationship suggests that achieving high fine-tuning performance typically results in greater forgetting of the pre-trained capabilities, aligning with the dataset-specific knowledge accrued during the pre-training phase. Figure 2

Figure 2

Figure 2: Fine-tuning performance vs Forgetting on OpenOrca and News datasets, illustrating the linear relationship and its implications.

Forgetting Metrics and Evaluation

To quantify forgetting rigorously, the paper introduces a metric based on the cross-entropy between the predictions of the pre-trained model and the fine-tuned model, providing a robust measure that accounts for prior knowledge and prediction shifts. This metric discerns forgetting from mere output format or task performance changes, offering a more precise evaluation of knowledge retention.

Power Law Scaling

In accordance with prior research on the scaling laws in LLMs, this paper finds that both forgetting and fine-tuning loss adhere to power law relationships with the number of parameters fine-tuned and the number of training steps. This is modeled as shifted power law functions, highlighting that increased parameter tuning and training steps exacerbate forgetting. These relationships suggest a fundamental scaling law that guides the optimization of fine-tuning strategies.

Observation of Forgetting Effects

The observed generative behaviors of fine-tuned models provide concrete evidence of forgetting effects. For instance, fine-tuning the LLM on domains such as "OpenOrca" or recent news datasets demonstrated significant forgetting of domain knowledge and reasoning capabilities previously exhibited by the models. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Forgetting and fine-tuning loss trajectories and fit curves for varying ranks and datasets, showcasing empirical fit and trends.

Implications and Future Directions

The implications of these findings are manifold. Firstly, they underscore the necessity for developing mitigation strategies that can balance fine-tuning performance with retention of critical pre-trained capabilities. Strategies could include hybrid optimization algorithms that refine PEFT methods or dynamically adjust the fine-tuning process based on real-time forgetting metrics.

Furthermore, the paper opens avenues for a more detailed exploration of how specific types of knowledge (e.g., reasoning versus safety protocols) are affected by different fine-tuning regimes, informing targeted adjustments in fine-tuning methodologies to preserve core competencies of LLMs.

Conclusion

This investigation extends the understanding of scaling laws within the context of LLM fine-tuning, emphasizing the interplay between model performance, parameter interactions, and the forgetting phenomenon. By establishing a quantitative framework for measuring and analyzing forgetting, the research delineates a new trajectory for enhancing the robustness and reliability of AI applications. The results, striking in their consistency, serve as a pivotal reference point for subsequent empirical and theoretical studies aimed at mitigating forgetting in LLMs.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Authors (1)

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com