Emergent Mind

Remembering Transformer for Continual Learning

(2404.07518)
Published Apr 11, 2024 in cs.LG and cs.CV

Abstract

Neural networks encounter the challenge of Catastrophic Forgetting (CF) in continual learning, where new task learning interferes with previously learned knowledge. Existing data fine-tuning and regularization methods necessitate task identity information during inference and cannot eliminate interference among different tasks, while soft parameter sharing approaches encounter the problem of an increasing model parameter size. To tackle these challenges, we propose the Remembering Transformer, inspired by the brain's Complementary Learning Systems (CLS). Remembering Transformer employs a mixture-of-adapters architecture and a generative model-based novelty detection mechanism in a pretrained Transformer to alleviate CF. Remembering Transformer dynamically routes task data to the most relevant adapter with enhanced parameter efficiency based on knowledge distillation. We conducted extensive experiments, including ablation studies on the novelty detection mechanism and model capacity of the mixture-of-adapters, in a broad range of class-incremental split tasks and permutation tasks. Our approach demonstrated SOTA performance surpassing the second-best method by 15.90% in the split tasks, reducing the memory footprint from 11.18M to 0.22M in the five splits CIFAR10 task.

RT combines low-rank modules and sparse activation with novelty detection for enhanced memory handling.

Overview

  • Addresses catastrophic forgetting (CF) in neural networks by introducing a novel Remembering Transformer architecture, leveraging a mixture-of-adapters framework and generative model-based routing to preserve task-specific knowledge.

  • Empirical evaluation on CIFAR10 and CIFAR100 datasets demonstrates the model's superior performance in class-incremental learning tasks, setting new benchmarks and achieving significant accuracy improvements with efficient parameter usage.

  • Highlights the importance of emulating biological learning systems to enhance artificial neural network architectures for continual learning, reducing computational and memory requirements.

  • Opens new research avenues in neural network design and continual learning strategies, suggesting future exploration into the scalability of the approach and its application across diverse tasks and network types.

Addressing Catastrophic Forgetting in Neural Networks with Remembering Transformer

Introduction to Catastrophic Forgetting in Continual Learning

Catastrophic Forgetting (CF) has been a critical barrier in the evolution of neural networks towards achieving genuine continual learning capabilities. This phenomenon occurs when neural networks fail to preserve previously learned knowledge upon acquiring new information, a challenge starkly contrasted by the versatility observed in biological neural networks. Traditional approaches like model fine-tuning through memory replay offer a palliative solution, yet fall short in scalability and efficacy as task complexity and numbers increase.

The Remembering Transformer Architecture

Inspired by the brain's Complementary Learning Systems theory, the Remembering Transformer introduces a novel architecture that integrates a mixture-of-adapters framework with generative model-based routing. This setup is adept at distributing task-specific data dynamically across relevant adapters, significantly mitigating the CF problem. It operates under the premise that managing task interference through isolated learning paths for different tasks can preserve task-specific knowledge without compromising the acquired knowledge on new tasks.

Key Components and Innovations

  • Mixture-of-Adapters: Building upon the Vision Transformer (ViT) model, the proposed system incorporates a series of trainable adapters, each tasked with a subset of the learning objectives. The low-rank adaptation (LoRA) methodology facilitates efficient fine-tuning and scalability.
  • Generative Model-based Routing: At the core of the adaptive learning process lies a collection of generative models, each encoding the data distribution of a task. This setup enables active routing of data to the most relevant adapter, ensuring efficient knowledge segregation and retrieval without explicit task identification during inference.
  • Efficient Task Learning in Real-world Scenarios: Evaluation on class-incremental learning without task identity information and under parameter size constraints showcases Remembering Transformer's practical applicability and robustness.

Empirical Evaluation and Results

Experiments conducted on CIFAR10 and CIFAR100 datasets partitioned into distinct class-incremental learning tasks affirm the model's significant strides in alleviating CF. Not only does the Remembering Transformer establish new state-of-the-art (SOTA) benchmarks across these challenging setups, but it does so with notable parameter efficiency. Comparisons reveal a marked improvement in model performance, achieving an average accuracy elevation of 15.90% over previous methods, alongside maintaining a reduced memory footprint.

The Practical Implications and Theoretical Significance

This study accentuates the potential of emulating biological systems' learning mechanisms to refine artificial neural network architectures. By effectively addressing CF, the Remembering Transformer broadens the horizons for deploying neural networks in dynamic, multi-task environments without the trade-offs previously associated with continual learning. The methodology presents a scalable framework that adapts to task intricacies with lower computational and memory requirements, a critical consideration for real-world applications.

Speculations on Future Directions

The introduction of generative model-based routing in the Remembering Transformer reveals an exciting avenue for research in neural network design and continual learning strategies. Future developments could explore the scalability of this approach across more diverse and complex task sets, delve into the optimization of adapter and model capacities, and test the applicability of similar principles in other types of neural network architectures beyond vision-based models.

Conclusion

The Remembering Transformer offers a significant leap forward in addressing the longstanding challenge of Catastrophic Forgetting in neural networks. By leveraging the complementary strengths of mixture-of-adapters and generative model-based routing, it achieves remarkable performance and efficiency in continual learning tasks, paving the way for more adaptive and resilient artificial intelligence systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.