Orthogonal Gradient Descent for Continual Learning

Published 15 Oct 2019 in cs.LG and stat.ML | (1910.07104v1)

Abstract: Neural networks are achieving state of the art and sometimes super-human performance on learning tasks across a variety of domains. Whenever these problems require learning in a continual or sequential manner, however, neural networks suffer from the problem of catastrophic forgetting; they forget how to solve previous tasks after being trained on a new task, despite having the essential capacity to solve both tasks if they were trained on both simultaneously. In this paper, we propose to address this issue from a parameter space perspective and study an approach to restrict the direction of the gradient updates to avoid forgetting previously-learned data. We present the Orthogonal Gradient Descent (OGD) method, which accomplishes this goal by projecting the gradients from new tasks onto a subspace in which the neural network output on previous task does not change and the projected gradient is still in a useful direction for learning the new task. Our approach utilizes the high capacity of a neural network more efficiently and does not require storing the previously learned data that might raise privacy concerns. Experiments on common benchmarks reveal the effectiveness of the proposed OGD method.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (307)

View on Semantic Scholar

Summary

The paper introduces the Orthogonal Gradient Descent method that projects new task gradients onto a subspace orthogonal to previous ones to preserve past learning.
It details variants such as OGD-ALL, OGD-AVE, and OGD-GTL, with OGD-GTL showing superior efficiency by focusing on ground truth logit gradients.
Experimental results on Permuted Mnist, Rotated Mnist, and Split Mnist benchmarks validate OGD’s effectiveness in reducing catastrophic forgetting compared to state-of-the-art methods.

Orthogonal Gradient Descent for Continual Learning

The paper "Orthogonal Gradient Descent for Continual Learning" presents a methodological advancement aimed at addressing the problem of catastrophic forgetting in neural networks. This phenomenon is prevalent when neural networks are exposed to sequential tasks: upon learning a new task, previously acquired information tends to be overwritten, leading to a degradation in performance on former tasks.

Methodological Insights

The primary contribution of this research is the Orthogonal Gradient Descent (OGD) method, which alters the conventional gradient descent procedure to mitigate forgetting. The approach capitalizes on the high-dimensional nature of neural network parameters. It projects the gradient updates for new tasks onto a subspace orthogonal to the gradient directions from previous tasks. This projection ensures minimal interference with prior learning, thereby preserving past knowledge while still accommodating new information.

The authors delineate several variants of the OGD mechanism, notably OGD-ALL, OGD-AVE, and OGD-GTL. These variants differ in their strategy for selecting the gradient dimensions to preserve. Empirical validation suggests that while each variant provides benefits, the OGD-GTL often outperforms the others due to its focus on the gradients of only the ground truth logit, resulting in a more efficient use of memory resources.

Empirical Validation

Experiments were conducted on three well-recognized benchmarks for continual learning: Permuted Mnist, Rotated Mnist, and Split Mnist. The OGD method demonstrated competitive performance against state-of-the-art techniques such as EWC and A-GEM. Notably, in scenarios involving multiple tasks, the OGD method retained a superior ability to remember previous tasks compared to its counterparts, reinforcing its effectiveness as a continual learning tool.

The experiments further explored the robustness of the OGD method under various configurations, such as changes in the learning rate, the number of training epochs, and storage capacity for gradients. The findings confirmed the method's adaptability and resilience across different empirical settings.

Theoretical Implications and Future Directions

Theoretically, the paper asserts the value of retaining gradient orthogonality to maintain invariant representations of prior tasks, opening up pathways for more sophisticated memory-efficient continual learning algorithms. The discussion in the paper suggests potential expansions of the method, such as advanced memory management strategies or the application of higher-order derivatives for preservation in more complex scenarios.

Future work could explore integration with different types of learning architectures and application domains beyond standard image classification tasks. Additionally, the implications of OGD extend to settings where observation over time is continuous, and discrete task boundaries are not apparent, making it a promising solution for real-world applications where continuous adaptation is essential.

Conclusion

Orthogonal Gradient Descent provides an innovative approach to the enduring challenge of catastrophic forgetting in continual learning. By efficiently utilizing neural network capacity and storage for gradient information, OGD demonstrates substantial promise, backed by thorough experimental support. This work lays a foundation for continued exploration in preserving learned knowledge within neural networks, promoting robust and adaptable machine learning systems.

Markdown Report Issue