Remembering Transformer for Continual Learning (2404.07518v3)
Abstract: Neural networks encounter the challenge of Catastrophic Forgetting (CF) in continual learning, where new task learning interferes with previously learned knowledge. Existing data fine-tuning and regularization methods necessitate task identity information during inference and cannot eliminate interference among different tasks, while soft parameter sharing approaches encounter the problem of an increasing model parameter size. To tackle these challenges, we propose the Remembering Transformer, inspired by the brain's Complementary Learning Systems (CLS). Remembering Transformer employs a mixture-of-adapters architecture and a generative model-based novelty detection mechanism in a pretrained Transformer to alleviate CF. Remembering Transformer dynamically routes task data to the most relevant adapter with enhanced parameter efficiency based on knowledge distillation. We conducted extensive experiments, including ablation studies on the novelty detection mechanism and model capacity of the mixture-of-adapters, in a broad range of class-incremental split tasks and permutation tasks. Our approach demonstrated SOTA performance surpassing the second-best method by 15.90% in the split tasks, reducing the memory footprint from 11.18M to 0.22M in the five splits CIFAR10 task.
- Sparse moes meet efficient ensembles. arXiv:2110.03360, 2021.
- Dark experience for general continual learning: a strong, simple baseline. In NeurIPS, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- A synaptic signal for novelty processing in the hippocampus. Nature Communications, 13(1):4122, 2022.
- Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- How do memory systems detect and respond to novelty? Neuroscience letters, 680:60–68, 2018.
- Continual learning based on OOD detection and task masking. In CVPR Workshops, pages 3855–3865. IEEE, 2022.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20(7):512–534, 2016.
- More classifiers, less forgetting: A generic multi-classifier paradigm for incremental learning. In ECCV (26), pages 699–716. Springer, 2020a.
- Mnemonics training: Multi-class incremental learning without forgetting. In CVPR, pages 12242–12251. Computer Vision Foundation / IEEE, 2020b.
- James L McGaugh. Memory–a century of consolidation. Science, 287(5451):248–251, 2000.
- Fetril: Feature translation for exemplar-free class-incremental learning. In WACV, pages 3900–3909. IEEE, 2023.
- icarl: Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer Society, 2017a.
- icarl: Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer Society, 2017b.
- Scaling vision with sparse mixture of experts. In NeurIPS, 2021.
- Continual learning with hypernetworks. In ICLR, 2020.
- Personalized federated learning via heterogeneous modular networks. In ICDM, 2022.
- Max Welling. Herding dynamical weights to learn. In ICML, pages 1121–1128. ACM, 2009.
- Supermasks in superposition. In NeurIPS, 2020.
- Continuous learning of context-dependent processing in neural networks. Nature Machine Intelligence.
- Class-incremental learning via dual augmentation. In NeurIPS, pages 14306–14318, 2021a.
- Prototype augmentation and self-supervision for incremental learning. In CVPR, pages 5871–5880. Computer Vision Foundation / IEEE, 2021b.
- Self-sustaining representation expansion for non-exemplar class-incremental learning. In CVPR, pages 9286–9295. IEEE, 2022.
- Yuwei Sun (18 papers)
- Jun Sakuma (46 papers)
- Ryota Kanai (28 papers)
- Ippei Fujisawa (12 papers)
- Arthur Juliani (8 papers)