- The paper presents a Recall and Learn mechanism that minimizes catastrophic forgetting by simulating pretraining objectives using a quadratic penalty.
- It introduces objective shifting with an annealing coefficient, balancing retention of pretraining knowledge with learning new tasks.
- Experiments on GLUE show that BERT-base gains +1.7% on average and ALBERT-xxlarge achieves state-of-the-art performance, especially on limited data.
 
 
      Recall and Learn: Fine-tuning Deep Pretrained LLMs with Less Forgetting
The paper "Recall and Learn: Fine-tuning Deep Pretrained LLMs with Less Forgetting" by Chen et al. addresses a significant issue in the domain of NLP: catastrophic forgetting during the fine-tuning of deep pretrained LMs. This problem arises when models tuned for specific downstream tasks lose the knowledge acquired during their pretraining phase, often resulting in suboptimal performance.
Main Contributions
The authors introduce a novel approach to mitigate catastrophic forgetting by combining sequential transfer learning with multi-task learning principles. This methodology is encapsulated in their "Recall and Learn" mechanism, comprising two key components: Pretraining Simulation and Objective Shifting. Together, these mechanisms enable a model to simultaneously recall pretraining knowledge and adapt to new tasks, thus reducing the extent of forgetting.
- Pretraining Simulation: This technique approximates pretraining objectives using a quadratic penalty, thereby enabling the model to recall knowledge without accessing the actual pretraining data. It capitalizes on the Fisher information matrix and approximates it with a more computationally tractable form, alleviating the need for large datasets during fine-tuning.
- Objective Shifting: By incorporating an annealing coefficient, this method dynamically balances the focus between maintaining pretraining knowledge and learning new tasks. Over time, it allows the learning process to gradually shift toward optimizing performance on downstream tasks.
Additionally, the paper introduces the Recall Adam (RecAdam) optimizer, which integrates these mechanisms into the traditional Adam optimizer framework. This integration supports more effective fine-tuning by decoupling the quadratic penalty and annealing coefficient from the adapted gradients.
Empirical Results
The experiments, conducted using BERT-base and ALBERT-xxlarge models, demonstrate the efficacy of the proposed approach on the General Language Understanding Evaluation (GLUE) benchmark. Key findings include:
- Significant performance improvements on 7 out of 8 tasks in the GLUE benchmark using BERT-base, particularly on datasets with limited labeled data, where improvements are shown to be +1.7% on average.
- Using RecAdam, BERT-base achieved a comparable, if not superior, performance to BERT-large in a number of tasks despite having fewer parameters.
- With ALBERT-xxlarge, the approach attained state-of-the-art results, particularly enhancing tasks with smaller training datasets by +1.5% average improvements over standard fine-tuning.
Theoretical and Practical Implications
The proposed method holds substantial theoretical and practical implications for NLP:
- Theory: The integration of multi-task learning principles with fine-tuning strategies enriches the potential for models to generalize better without sacrificing learned knowledge. This approach encourages further exploration of model training paradigms where multi-task learning can provide appreciable benefits in alleviating forgetting in sequential learning settings.
- Practice: Implementing the RecAdam optimizer in existing LLMs facilitates improved performance on specific tasks with limited data availability and supports efficient use of pretrained resources. This innovation could significantly enhance the deployment of NLP applications in real-world scenarios where labeled data is scarce.
Future Directions
Future research could focus on fine-tuning the annealing strategies and quadratic penalty approximations to further enhance the adaptive capabilities in language modeling tasks. Moreover, exploring similar paradigms across different domains and architectures, extending beyond NLP, could yield broader insights into managing catastrophic forgetting in various artificial intelligence applications.
This work is a notable contribution towards optimizing the learning process of pretrained LLMs, striking a balance between retaining prior knowledge and adapting to new information, which is crucial for advancing performance in complex NLP tasks.