Adversarial Feature Alignment: Avoid Catastrophic Forgetting in Incremental Task Lifelong Learning (1910.10986v1)

Published 24 Oct 2019 in cs.LG, cs.CV, and stat.ML

Abstract: Human beings are able to master a variety of knowledge and skills with ongoing learning. By contrast, dramatic performance degradation is observed when new tasks are added to an existing neural network model. This phenomenon, termed as \emph{Catastrophic Forgetting}, is one of the major roadblocks that prevent deep neural networks from achieving human-level artificial intelligence. Several research efforts, e.g. \emph{Lifelong} or \emph{Continual} learning algorithms, have been proposed to tackle this problem. However, they either suffer from an accumulating drop in performance as the task sequence grows longer, or require to store an excessive amount of model parameters for historical memory, or cannot obtain competitive performance on the new tasks. In this paper, we focus on the incremental multi-task image classification scenario. Inspired by the learning process of human students, where they usually decompose complex tasks into easier goals, we propose an adversarial feature alignment method to avoid catastrophic forgetting. In our design, both the low-level visual features and high-level semantic features serve as soft targets and guide the training process in multiple stages, which provide sufficient supervised information of the old tasks and help to reduce forgetting. Due to the knowledge distillation and regularization phenomenons, the proposed method gains even better performance than finetuning on the new tasks, which makes it stand out from other methods. Extensive experiments in several typical lifelong learning scenarios demonstrate that our method outperforms the state-of-the-art methods in both accuracies on new tasks and performance preservation on old tasks.

Citations (21)

View on Semantic Scholar

Summary

The paper introduces Adversarial Feature Alignment (AFA), a novel method using adversarial and MMD-based feature alignment to prevent catastrophic forgetting in incremental lifelong learning.
AFA employs a two-stream network architecture and aligns features at different levels using a GAN-like adversarial game for low-level visual features and Maximum Mean Discrepancy (MMD) for high-level semantic features.
Experiments demonstrate that AFA significantly mitigates catastrophic forgetting on old tasks while achieving high accuracy on new tasks, often outperforming state-of-the-art methods like LwF, EWC, SI, and MAS.

The paper introduces a novel activation regularization method, termed Adversarial Feature Alignment (AFA), to mitigate catastrophic forgetting in incremental multi-task image classification scenarios. The core idea is to use intermediate activations of a pre-trained model, which encapsulates knowledge from previous tasks, as soft targets to guide the training process when adapting to new data.

AFA's framework comprises a two-stream model representing the old and new networks. Beyond the cross-entropy loss for the new task and the distillation loss between classification probabilities, the method leverages both low-level visual features and high-level semantic features as soft targets. This is intended to provide comprehensive supervised information about the old tasks via multilevel feature alignment.

The method aligns convolutional visual features by introducing a trainable discriminator network to play a GAN-like minimax game with the feature extractors of the old and new models. The discriminator aims to distinguish latent representations encoded by an activation-based mapping function from the convolutional feature maps of the old and new networks. The mapping function, $F_{att}$ , takes a 3D tensor $A \in R^{C \times H \times W}$ as input and outputs a spatial attention map. $F_{att}$ is defined as: $F_{att} (A) = \sum^C_{ch=1} |A_{ch}|^2$ , where $A_{ch} \in R^{H \times W}$ .

$A_{ch}$ is the $ch$ -th feature map of activation tensor $A$ . The discriminator is optimized via:

$\mathcal{L}_{adv_D} = \max\limits_D\mathbb{E}_{z^* \sim Z^*}[\text{log}D(z^*)]+\mathbb{E}_{z \sim Z}[\text{log}(1 - D(z))]$

where $Z^*$ and $Z$ are latent representations from the old and new feature extractors respectively. The feature extractor $F$ is updated by playing a minimax game with the discriminator $D$ via: $\mathcal{L}_{adv_F} = \min\limits_F-\mathbb{E}_{z \sim Z}[\text{log}D(z)]$ .

The paper aligns high-level semantic features using Maximum Mean Discrepancy (MMD). MMD is expressed as the distance between the means of two data distributions $P$ and $Q$ after mapping to a reproducing kernel Hilbert space (RKHS):

$\mathit{MMD}^2(P, Q) = \| \mathbb{E}_{p \sim P}[\phi(p)] - \mathbb{E}_{q \sim Q}[\phi(q)] \|^2$

where $\phi(\cdot)$ denotes the mapping to RKHS.

An unbiased estimator of MMD is given by:

$\mathcal{L}_{mmd}(P, Q) = \mathbb{E}_{p,q\sim P, Q}[k(p, p) + k(q, q) - 2k(p, q)]$

where $k(p, q) = \langle\phi(p), \phi(q)\rangle$ is the kernel function.

The overall loss function is a weighted sum:

$\mathcal{L} = \mathcal{L}_{cls} + \lambda_1\mathcal{L}_{dist} + \lambda_2\mathcal{L}_{adv_F} + \lambda_3\mathcal{L}_{fc}$

where $\mathcal{L}_{cls}$ is the cross-entropy loss, $\mathcal{L}_{dist}$ is the distillation loss, $\mathcal{L}_{adv_F}$ is the adversarial loss for feature alignment, and $\mathcal{L}_{fc}$ is the MMD loss for high-level feature alignment.

The paper details experiments in incremental task scenarios, including two-task (starting from ImageNet or Oxford Flowers datasets) and five-task settings (Scenes, Birds, Flowers, Aircraft, and Cars datasets), comparing AFA to joint training, finetuning, Learning without Forgetting (LwF), Encoder-Based Lifelong Learning (EBLL), Elastic Weights Consolidation (EWC), Synaptic Intelligence (SI), and Memory Aware Synapses (MAS).

Key findings include:

AFA generally suffers the least performance drop on old tasks while achieving high accuracy on new tasks.
AFA and LwF outperform joint training when new tasks involve smaller datasets than the initial dataset (ImageNet), which prevents overfitting.
Parameter regularization strategies (EWC, SI, MAS) can struggle when tasks have different output domains or start from small datasets.

Ablation studies demonstrate the individual contributions of adversarial attention alignment and MMD-based high-level feature alignment. The paper also explores alternative constraints for visual and fully connected features, such as L2 regularization, and provides implementation details, including network architecture, training parameters, and hyperparameter selection.

PDF Markdown

Adversarial Feature Alignment: Avoid Catastrophic Forgetting in Incremental Task Lifelong Learning (1910.10986v1)

Summary

Related Papers