Emergent Mind

Orthogonal Subspace Learning for Language Model Continual Learning

(2310.14152)
Published Oct 22, 2023 in cs.CL and cs.LG

Abstract

Benefiting from massive corpora and advanced hardware, LLMs exhibit remarkable capabilities in language understanding and generation. However, their performance degrades in scenarios where multiple tasks are encountered sequentially, also known as catastrophic forgetting. In this paper, we propose orthogonal low-rank adaptation (O-LoRA), a simple and efficient approach for continual learning in language models, effectively mitigating catastrophic forgetting while learning new tasks. Specifically, O-LoRA learns tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference. Our method induces only marginal additional parameter costs and requires no user data storage for replay. Experimental results on continual learning benchmarks show that our method outperforms state-of-the-art methods. Furthermore, compared to previous approaches, our method excels in preserving the generalization ability of LLMs on unseen tasks.

Overview

  • The paper introduces orthogonal low-rank adaptation (O-LoRA), a method to reduce catastrophic forgetting in language models by training on different tasks using orthogonal vector subspaces.

  • O-LoRA fine-tunes language model parameters for each task separately, in distinct low-rank subspaces, to avoid overlap with previously learned information.

  • Empirical results show that O-LoRA outperforms existing methods on sequential learning benchmarks and maintains model generalization on unseen tasks.

  • The approach is data privacy-friendly, efficient in using model parameters, and demonstrates adaptability without needing user data storage.

Introduction

The paper under discussion presents a solution to a significant problem faced by LLMs: catastrophic forgetting. LLMs are typically proficient at understanding and generating language; however, when trained sequentially on multiple tasks, they tend to lose information about the previous ones. The proposed orthogonal low-rank adaptation (O-LoRA) method counters this by learning tasks within different low-rank orthogonal vector subspaces. The approach induces a marginal increase in parameters and requires no prior data replay, demonstrating promising results on benchmarks.

Approach

O-LoRA introduces a new way to fine-tune LLMs, where parameters for each task are updated in a separate subspace, avoiding interference with knowledge from other tasks. By knowing that the learning occurs in a low-rank subspace, it's hypothesized that the gradient subspaces from previous tasks can be encapsulated in the LoRA parameters, preserving prior knowledge while absorbing new information. O-LoRA constrains the gradient updates for the current task to be orthogonal to the subspace represented by the LoRA parameters of previous tasks. This orthogonal step is a critical component in preventing catastrophic forgetting and is visually represented in the paper's Figure 1.

Empirical Results

The empirical assessments highlight the performance benefits of O-LoRA. On sequential learning benchmarks, the methodology outperforms existing state-of-the-art methods. Most notably, O-LoRA not only mitigates forgetting but also maintains the generalization capabilities of LLMs on unseen tasks, supporting the notion that it adjusts the models' learning in a way that doesn't compromise their ability to tackle novel situations. Such results confront the challenge that prior methodologies have faced where catastrophic forgetting or the lack of generalization to unseen tasks was prevalent.

Discussion and Conclusion

The presented orthogonal subspace learning approach is elegant, data privacy-friendly—eliminating the need for user data storage—and demonstrates model parameter efficiency by pulling in minimal extra parameters. The paper provides convincing evidence that O-LoRA is effective in practice, showcased by its performance in various benchmarks. The approach is attuned with the requirements of modern AI systems where the adaptability of models, data security, and computational efficiency are central. While O-LoRA addresses the continuity issue in LLMs by learning new information orthogonally to past knowledge, the exploration of its scalability to larger task sequences and its performance with different architectures and sizes of language models reflects the depth of analysis conducted. Impressively, their method showcases competence close to multitask learning performance without compromising the adaptiveness and generalizability that makes LLMs so valuable.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.