Continual Learning of Large Language Models: A Comprehensive Survey (2404.16789v3)

Published 25 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The recent success of LLMs trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant performance degradation in previous knowledge domains -- a phenomenon known as "catastrophic forgetting". While extensively studied in the continual learning (CL) community, it presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview of the current research progress on LLMs within the context of CL. This survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). We then summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). The full list of papers examined in this survey is available at https://github.com/Wang-ML-Lab/LLM-continual-learning-survey.

Citations (40)

View on Semantic Scholar

Summary

The paper introduces a framework that organizes continual learning in LLMs along vertical and horizontal continuities to address catastrophic forgetting.
It systematically compares methods for continual pre-training, domain-adaptive pre-training, and fine-tuning with a focus on data and parameter efficiency.
The survey outlines evaluation metrics and future directions, emphasizing replay strategies, external memory systems, and optimized adaptation processes.

The paper "Continual Learning of LLMs: A Comprehensive Survey" (Continual Learning of Large Language Models: A Comprehensive Survey, 25 Apr 2024) provides a thorough overview of applying continual learning (CL) techniques to LLMs to address the challenges of adapting them to dynamic data distributions and tasks while avoiding catastrophic forgetting of previously learned knowledge.

The core motivation stems from the observation that LLMs, typically pre-trained on static, large-scale datasets, experience performance degradation when confronted with new data over time or when adapted to specific domains and user needs. Retraining LLMs from scratch on combined old and new data is computationally prohibitive. Continual learning, which focuses on incremental knowledge acquisition without forgetting, offers a potential solution.

The survey introduces preliminary concepts of LLMs (pre-training objectives like LM and MLM, architectures like GPT, PaLM, LLaMA, and adaptation techniques like Instruction Tuning, Model Refinement, and Model Alignment) and CL (memory constraints, and techniques like replay, regularization, and architecture-based methods).

A key contribution of the paper is a conceptual framework that organizes the landscape of continually learning LLMs along two dimensions of continuity:

Vertical Continuity (Vertical Continual Learning): This refers to the hierarchical process of adapting LLMs from general capabilities to specific ones, typically involving transitions from large-scale general pre-training to smaller-scale domain-adaptive pre-training and finally to downstream fine-tuning. The challenge here is "vertical forgetting," the degradation of general knowledge during adaptation to specific domains. Challenges include task heterogeneity between stages and the inaccessibility of upstream data.
Horizontal Continuity (Horizontal Continual Learning): This describes the ongoing adaptation of a deployed LLM over time as new data arrives (e.g., due to temporal or content shifts). The primary goal is to prevent "horizontal forgetting," the loss of performance on previous data/tasks. Challenges include longer sequences of incremental phases and abrupt distributional shifts.

The paper structures its comprehensive discussion around the learning stages within this vertical continuity framework, particularly focusing on how horizontal continuity manifests at each stage:

Continual Pre-Training (CPT): This is the initial stage where the foundational LLM is continuously updated with newly collected data by the supplier. The paper highlights that research in CPT is still nascent, with many studies empirically evaluating existing CL techniques rather than proposing novel ones specifically for this scale. The most commonly applied techniques are architecture expansion (like MoE or domain-specific layers) and, less frequently, replay and parameter regularization. CPT research considers distributional shifts related to language (adapting to new languages), content (adapting to new domains), and temporal changes (adapting to evolving data over time). Efficiency, particularly data efficiency, is a crucial aspect.
Domain-Adaptive Pre-training (DAP): This stage involves further pre-training a general LLM on a large corpus of unlabeled data specific to a particular domain (e.g., legal, medical, financial, code). While often viewed as a preparatory step, the paper frames it as a form of vertical continual learning. DAP is widely adopted, with many domain-specific LLMs employing it. Continual learning is implicitly or explicitly addressed, either by evaluating forgetting of general capabilities or by using CL techniques. Replay (often termed "data mixing" or "data combination") and parameter-efficient methods like LoRA are common, though the diversity of CL techniques is limited. Studies in this area explore how DAP affects performance on general and domain-specific tasks and investigate data selection strategies for efficiency.
Continual Fine-Tuning (CFT): This is the final stage where models are adapted to specific downstream tasks on the consumer side. The paper notes that CFT is a more extensively studied area in the traditional CL community compared to CPT and DAP, but new paradigms emerge with LLMs. Sub-categories of CFT discussed include:
- Continual Instruction Tuning (CIT): Adapting LLMs to follow sequences of tasks specified by natural language instructions. This often aligns with Task-Incremental Learning (TIL). Replay-based methods and parameter-efficient tuning are popular, leveraging the instruction format as task information.
- Continual Model Refinement (CMR): Updating LLMs to correct specific errors or facts as new information becomes available. This is presented as a type of Domain-Incremental Learning (DIL) with a focus on fine-grained updates. Methods often involve retrieval mechanisms or parameter-efficient updates to localized parameters.
- Continual Model Alignment (CMA): Continuously aligning LLMs with evolving human values, ethics, and preferences (e.g., via RLHF or DPO). This involves addressing the "Alignment Tax" (forgetting of general capabilities) and can be approached with RL-based or SL-based CL techniques.
- Continual Multimodal LLMs (CMLLMs): Adapting MLLMs to new multimodal tasks or data streams. Due to the complex architecture of MLLMs, continual learning often focuses on specific components like projection layers or uses techniques like MoE or parameter-efficient tuning, often in TIL settings.

The survey also covers evaluation protocols and datasets. Standard CL metrics like Overall Performance, Forgetting/Backward Transfer, and Forward Transfer, based on a Performance Matrix, are adapted for LLMs using relevant performance indicators (perplexity, accuracy, etc.). Specific metrics for LLMs like LAMA, FUAR, and X-Delta are also discussed. The paper provides a comprehensive list of publicly available datasets used in CPT, DAP, CIT, CMR, and CMA research.

Finally, the paper concludes with a discussion of intriguing properties and future directions. It notes the emergent property of "anticipatory recovering" in sequential fine-tuning. It highlights the shift in emphasis from Class-Incremental Learning (CIL) to TIL and DIL in LLM CL, while noting that CIL concepts are still relevant (e.g., vocabulary expansion). The role of memory constraints is re-evaluated, suggesting a shift from storage efficiency to computational efficiency in large-scale settings, though strict memory constraints remain relevant for data privacy or lack of upstream data access. Promising future research areas include developing theoretical foundations for continual LLMs, designing computationally efficient replay strategies, incorporating controllable external memory systems, and enabling continual adaptation to customized user preferences.

In summary, the survey provides a valuable framework for understanding continual learning for LLMs, reviewing existing research, identifying key challenges, and pointing towards promising avenues for future investigation to develop more adaptable, efficient, and robust LLMs.