Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Remembering Transformer for Continual Learning (2404.07518v3)

Published 11 Apr 2024 in cs.LG and cs.CV

Abstract: Neural networks encounter the challenge of Catastrophic Forgetting (CF) in continual learning, where new task learning interferes with previously learned knowledge. Existing data fine-tuning and regularization methods necessitate task identity information during inference and cannot eliminate interference among different tasks, while soft parameter sharing approaches encounter the problem of an increasing model parameter size. To tackle these challenges, we propose the Remembering Transformer, inspired by the brain's Complementary Learning Systems (CLS). Remembering Transformer employs a mixture-of-adapters architecture and a generative model-based novelty detection mechanism in a pretrained Transformer to alleviate CF. Remembering Transformer dynamically routes task data to the most relevant adapter with enhanced parameter efficiency based on knowledge distillation. We conducted extensive experiments, including ablation studies on the novelty detection mechanism and model capacity of the mixture-of-adapters, in a broad range of class-incremental split tasks and permutation tasks. Our approach demonstrated SOTA performance surpassing the second-best method by 15.90% in the split tasks, reducing the memory footprint from 11.18M to 0.22M in the five splits CIFAR10 task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Sparse moes meet efficient ensembles. arXiv:2110.03360, 2021.
  2. Dark experience for general continual learning: a strong, simple baseline. In NeurIPS, 2020.
  3. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  4. A synaptic signal for novelty processing in the hippocampus. Nature Communications, 13(1):4122, 2022.
  5. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
  6. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  7. How do memory systems detect and respond to novelty? Neuroscience letters, 680:60–68, 2018.
  8. Continual learning based on OOD detection and task masking. In CVPR Workshops, pages 3855–3865. IEEE, 2022.
  9. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  10. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20(7):512–534, 2016.
  11. More classifiers, less forgetting: A generic multi-classifier paradigm for incremental learning. In ECCV (26), pages 699–716. Springer, 2020a.
  12. Mnemonics training: Multi-class incremental learning without forgetting. In CVPR, pages 12242–12251. Computer Vision Foundation / IEEE, 2020b.
  13. James L McGaugh. Memory–a century of consolidation. Science, 287(5451):248–251, 2000.
  14. Fetril: Feature translation for exemplar-free class-incremental learning. In WACV, pages 3900–3909. IEEE, 2023.
  15. icarl: Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer Society, 2017a.
  16. icarl: Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer Society, 2017b.
  17. Scaling vision with sparse mixture of experts. In NeurIPS, 2021.
  18. Continual learning with hypernetworks. In ICLR, 2020.
  19. Personalized federated learning via heterogeneous modular networks. In ICDM, 2022.
  20. Max Welling. Herding dynamical weights to learn. In ICML, pages 1121–1128. ACM, 2009.
  21. Supermasks in superposition. In NeurIPS, 2020.
  22. Continuous learning of context-dependent processing in neural networks. Nature Machine Intelligence.
  23. Class-incremental learning via dual augmentation. In NeurIPS, pages 14306–14318, 2021a.
  24. Prototype augmentation and self-supervision for incremental learning. In CVPR, pages 5871–5880. Computer Vision Foundation / IEEE, 2021b.
  25. Self-sustaining representation expansion for non-exemplar class-incremental learning. In CVPR, pages 9286–9295. IEEE, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuwei Sun (18 papers)
  2. Jun Sakuma (46 papers)
  3. Ryota Kanai (28 papers)
  4. Ippei Fujisawa (12 papers)
  5. Arthur Juliani (8 papers)
Citations (1)

Summary

  • The paper introduces a novel model that integrates a mixture-of-adapters with generative model-based routing to mitigate catastrophic forgetting.
  • It employs low-rank adaptation for fine-tuning and achieves an average accuracy improvement of 15.9% on CIFAR benchmarks.
  • The approach offers scalability and reduced memory usage, paving the way for more resilient continual learning systems.

Addressing Catastrophic Forgetting in Neural Networks with Remembering Transformer

Introduction to Catastrophic Forgetting in Continual Learning

Catastrophic Forgetting (CF) has been a critical barrier in the evolution of neural networks towards achieving genuine continual learning capabilities. This phenomenon occurs when neural networks fail to preserve previously learned knowledge upon acquiring new information, a challenge starkly contrasted by the versatility observed in biological neural networks. Traditional approaches like model fine-tuning through memory replay offer a palliative solution, yet fall short in scalability and efficacy as task complexity and numbers increase.

The Remembering Transformer Architecture

Inspired by the brain's Complementary Learning Systems theory, the Remembering Transformer introduces a novel architecture that integrates a mixture-of-adapters framework with generative model-based routing. This setup is adept at distributing task-specific data dynamically across relevant adapters, significantly mitigating the CF problem. It operates under the premise that managing task interference through isolated learning paths for different tasks can preserve task-specific knowledge without compromising the acquired knowledge on new tasks.

Key Components and Innovations

  • Mixture-of-Adapters: Building upon the Vision Transformer (ViT) model, the proposed system incorporates a series of trainable adapters, each tasked with a subset of the learning objectives. The low-rank adaptation (LoRA) methodology facilitates efficient fine-tuning and scalability.
  • Generative Model-based Routing: At the core of the adaptive learning process lies a collection of generative models, each encoding the data distribution of a task. This setup enables active routing of data to the most relevant adapter, ensuring efficient knowledge segregation and retrieval without explicit task identification during inference.
  • Efficient Task Learning in Real-world Scenarios: Evaluation on class-incremental learning without task identity information and under parameter size constraints showcases Remembering Transformer's practical applicability and robustness.

Empirical Evaluation and Results

Experiments conducted on CIFAR10 and CIFAR100 datasets partitioned into distinct class-incremental learning tasks affirm the model's significant strides in alleviating CF. Not only does the Remembering Transformer establish new state-of-the-art (SOTA) benchmarks across these challenging setups, but it does so with notable parameter efficiency. Comparisons reveal a marked improvement in model performance, achieving an average accuracy elevation of 15.90% over previous methods, alongside maintaining a reduced memory footprint.

The Practical Implications and Theoretical Significance

This paper accentuates the potential of emulating biological systems' learning mechanisms to refine artificial neural network architectures. By effectively addressing CF, the Remembering Transformer broadens the horizons for deploying neural networks in dynamic, multi-task environments without the trade-offs previously associated with continual learning. The methodology presents a scalable framework that adapts to task intricacies with lower computational and memory requirements, a critical consideration for real-world applications.

Speculations on Future Directions

The introduction of generative model-based routing in the Remembering Transformer reveals an exciting avenue for research in neural network design and continual learning strategies. Future developments could explore the scalability of this approach across more diverse and complex task sets, delve into the optimization of adapter and model capacities, and test the applicability of similar principles in other types of neural network architectures beyond vision-based models.

Conclusion

The Remembering Transformer offers a significant leap forward in addressing the longstanding challenge of Catastrophic Forgetting in neural networks. By leveraging the complementary strengths of mixture-of-adapters and generative model-based routing, it achieves remarkable performance and efficiency in continual learning tasks, paving the way for more adaptive and resilient artificial intelligence systems.