Understanding Catastrophic Forgetting in Language Models via Implicit Inference (2309.10105v2)
Abstract: We lack a systematic understanding of the effects of fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback), particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of capabilities on other tasks. We hypothesize that LLMs implicitly infer the task of the prompt and that fine-tuning skews this inference towards tasks in the fine-tuning distribution. To test this, we propose Conjugate Prompting, which artificially makes the task look farther from the fine-tuning distribution while requiring the same capability, and we find that this recovers some of the pretraining capabilities in our synthetic setup. Since real-world fine-tuning distributions are predominantly English, we apply conjugate prompting to recover pretrained capabilities in LLMs by simply translating the prompts to different languages. This allows us to recover in-context learning abilities lost via instruction tuning, natural reasoning capability lost during code fine-tuning, and, more concerningly, harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT.
- Mega: Multilingual evaluation of generative ai, 2023.
- What learning algorithm is in-context learning? investigations with linear models, 2022.
- Learning to learn by gradient descent by gradient descent, 2016.
- Massively multilingual neural machine translation in the wild: Findings and challenges, 2019.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Are aligned neural networks adversarially aligned?, 2023.
- Data distributional properties drive emergent in-context learning in transformers, 2022.
- Deep reinforcement learning from human preferences, 2023.
- Scaling instruction-finetuned language models, 2022.
- Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers, 2023.
- Regularized multi–task learning. pp. 109–117, 08 2004. doi: 10.1145/1014052.1014067.
- Model-agnostic meta-learning for fast adaptation of deep networks. CoRR, abs/1703.03400, 2017. URL http://arxiv.org/abs/1703.03400.
- The pile: An 800gb dataset of diverse text for language modeling, 2020.
- Making pre-trained language models better few-shot learners, 2021.
- What can transformers learn in-context? a case study of simple function classes, 2023.
- An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2015.
- Gradient-based adversarial attacks against text transformers, 2021.
- Karen Hao. The hidden workforce that helped filter violence and abuse out of chatgpt, 2023.
- Preventing verbatim memorization in language models gives a false sense of privacy, 2022.
- Measuring catastrophic forgetting in neural networks, 2017.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Meta learning backpropagation and improving it, 2022.
- General-purpose in-context learning by meta-learning transformers, 2022.
- Transformers as algorithms: Generalization and stability in in-context learning, 2023.
- Few-shot learning with multilingual language models, 2022.
- Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation - Advances in Research and Theory, 24(C):109–165, January 1989. ISSN 0079-7421. doi: 10.1016/S0079-7421(08)60536-8. Funding Information: The research reported in this chapter was supported by NIH grant NS21047 to Michael McCloskey, and by a grant from the Sloan Foundation to Neal Cohen. We thank Sean Purcell and Andrew Olson for assistance in generating the figures, and Alfonso Caramazza, Walter Harley, Paul Macaruso, Jay McClelland, Andrew Olson, Brenda Rapp, Roger Rat-cliff, David Rumelhart, and Terry Sejnowski for helpful discussions.
- The inverse scaling prize, 2022. URL https://github.com/inverse-scaling/prize.
- Metaicl: Learning to learn in context, 2022a.
- Rethinking the role of demonstrations: What makes in-context learning work?, 2022b.
- Cross-task generalization via natural language crowdsourcing instructions, 2022.
- Training language models to follow instructions with human feedback, 2022.
- What in-context learning ”learns” in-context: Disentangling task recognition and task learning, 2023.
- Continual lifelong learning with neural networks: A review. Neural networks, 113:54–71, 2019.
- Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions, 2022.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
- Multitask prompted training enables zero-shot task generalization, 2022.
- Language models are multilingual chain-of-thought reasoners, 2022.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts, 2020.
- Learning to summarize from human feedback, 2022.
- Task ambiguity in humans and language models, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Transformers learn in-context by gradient descent, 2022.
- Jailbroken: How does llm safety training fail?, 2023a.
- Finetuned language models are zero-shot learners, 2022.
- Larger language models do in-context learning differently, 2023b.
- Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016.
- An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
- Meta-learning without memorization, 2020.
- Opt: Open pre-trained transformer language models, 2022.
- Fine-tuning language models from human preferences, 2020.
- Universal and transferable adversarial attacks on aligned language models, 2023.