Orthogonal Subspace Learning for Language Model Continual Learning (2310.14152v1)

Published 22 Oct 2023 in cs.CL and cs.LG

Abstract: Benefiting from massive corpora and advanced hardware, LLMs exhibit remarkable capabilities in language understanding and generation. However, their performance degrades in scenarios where multiple tasks are encountered sequentially, also known as catastrophic forgetting. In this paper, we propose orthogonal low-rank adaptation (O-LoRA), a simple and efficient approach for continual learning in LLMs, effectively mitigating catastrophic forgetting while learning new tasks. Specifically, O-LoRA learns tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference. Our method induces only marginal additional parameter costs and requires no user data storage for replay. Experimental results on continual learning benchmarks show that our method outperforms state-of-the-art methods. Furthermore, compared to previous approaches, our method excels in preserving the generalization ability of LLMs on unseen tasks.

Citations (62)

View on Semantic Scholar

Summary

The paper introduces O-LoRA, a novel method that allocates task-specific learning in distinct subspaces to prevent catastrophic forgetting.
It employs orthogonal low-rank adaptation to ensure new task gradients remain separate from those of previous tasks.
Empirical results show O-LoRA outperforms state-of-the-art techniques on sequential learning benchmarks while adding only minimal parameters.

Introduction

The paper under discussion presents a solution to a significant problem faced by LLMs: catastrophic forgetting. LLMs are typically proficient at understanding and generating language; however, when trained sequentially on multiple tasks, they tend to lose information about the previous ones. The proposed orthogonal low-rank adaptation (O-LoRA) method counters this by learning tasks within different low-rank orthogonal vector subspaces. The approach induces a marginal increase in parameters and requires no prior data replay, demonstrating promising results on benchmarks.

Approach

O-LoRA introduces a new way to fine-tune LLMs, where parameters for each task are updated in a separate subspace, avoiding interference with knowledge from other tasks. By knowing that the learning occurs in a low-rank subspace, it's hypothesized that the gradient subspaces from previous tasks can be encapsulated in the LoRA parameters, preserving prior knowledge while absorbing new information. O-LoRA constrains the gradient updates for the current task to be orthogonal to the subspace represented by the LoRA parameters of previous tasks. This orthogonal step is a critical component in preventing catastrophic forgetting and is visually represented in the paper's Figure 1.

Empirical Results

The empirical assessments highlight the performance benefits of O-LoRA. On sequential learning benchmarks, the methodology outperforms existing state-of-the-art methods. Most notably, O-LoRA not only mitigates forgetting but also maintains the generalization capabilities of LLMs on unseen tasks, supporting the notion that it adjusts the models' learning in a way that doesn't compromise their ability to tackle novel situations. Such results confront the challenge that prior methodologies have faced where catastrophic forgetting or the lack of generalization to unseen tasks was prevalent.

Discussion and Conclusion

The presented orthogonal subspace learning approach is elegant, data privacy-friendly—eliminating the need for user data storage—and demonstrates model parameter efficiency by pulling in minimal extra parameters. The paper provides convincing evidence that O-LoRA is effective in practice, showcased by its performance in various benchmarks. The approach is attuned with the requirements of modern AI systems where the adaptability of models, data security, and computational efficiency are central. While O-LoRA addresses the continuity issue in LLMs by learning new information orthogonally to past knowledge, the exploration of its scalability to larger task sequences and its performance with different architectures and sizes of LLMs reflects the depth of analysis conducted. Impressively, their method showcases competence close to multitask learning performance without compromising the adaptiveness and generalizability that makes LLMs so valuable.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mcy_219085/status/1810031947663421918