Gradient Surgery for Multi-Task Learning

Published 19 Jan 2020 in cs.LG, cs.CV, cs.RO, and stat.ML | (2001.06782v4)

Abstract: While deep learning and deep reinforcement learning (RL) systems have demonstrated impressive results in domains such as image classification, game playing, and robotic control, data efficiency remains a major challenge. Multi-task learning has emerged as a promising approach for sharing structure across multiple tasks to enable more efficient learning. However, the multi-task setting presents a number of optimization challenges, making it difficult to realize large efficiency gains compared to learning tasks independently. The reasons why multi-task learning is so challenging compared to single-task learning are not fully understood. In this work, we identify a set of three conditions of the multi-task optimization landscape that cause detrimental gradient interference, and develop a simple yet general approach for avoiding such interference between task gradients. We propose a form of gradient surgery that projects a task's gradient onto the normal plane of the gradient of any other task that has a conflicting gradient. On a series of challenging multi-task supervised and multi-task RL problems, this approach leads to substantial gains in efficiency and performance. Further, it is model-agnostic and can be combined with previously-proposed multi-task architectures for enhanced performance.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (1,044)

View on Semantic Scholar

Summary

The paper introduces PCGrad, a method that projects conflicting gradients to mitigate interference in multi-task learning.
It provides rigorous convergence guarantees and conditions for a lower loss compared to standard gradient descent.
Empirical results on benchmarks like CIFAR-100 and Meta-World show significant accuracy and success rate improvements.

Gradient Surgery for Multi-Task Learning

Gradient Surgery for Multi-Task Learning presents an innovative method termed Projecting Conflicting Gradients (PCGrad) to address optimization challenges inherent in multi-task learning (MTL). The authors supplement an already rich body of research by proposing a method that directly modifies gradients to alleviate negative interactions between tasks, leading to enhanced multi-task learning performance in both supervised learning and reinforcement learning (RL) domains.

Background and Motivation

Deep learning and deep RL have demonstrated remarkable success across various tasks, including image classification and robotic control. However, the learning efficiency significantly diminishes when applying these techniques to multiple tasks simultaneously, a setting termed multi-task learning. The optimization landscape for multi-task learning is less understood compared to single-task learning, often leading to worse performance and data inefficiency. Previous works have struggled to identify the exact causes and have often reverted to task-specific models before combining them, undermining efficiency gains.

Insight and Approach

The paper introduces a critical insight: detrimental gradient interference is a primary cause of inefficacy in multi-task optimization. The authors describe these interferences through three essential conditions:

Conflicting Gradients: Gradients from different tasks point in opposite directions.
Dominating Gradients: Differences in gradient magnitudes, where one task's gradient swamps others.
High Curvature: Positive curvature along the multi-task gradient direction exaggerates this conflict.

The core contribution of the paper is the PCGrad method that aims to mitigate this interference by modifying gradients during optimization. Specifically, PCGrad projects a task's gradient onto the normal plane of any conflicting gradient from another task. This projection prevents destructive interference, confirmed through theoretical analyses and extensive empirical evaluations.

Theoretical Foundations

PCGrad’s theoretical backing includes convergence guarantees under standard convex optimization assumptions. The authors showcase that:

Convergence: PCGrad converges effectively even in settings with movement towards an optimal value or a scenario where gradients conflict entirely.
Local Optimality: The authors provide sufficient conditions under which PCGrad can assure a lower loss value when compared to standard gradient descent, most relevant when domination and positive curvature coexist with conflicting gradients.

Empirical Results

Empirical validation across multi-task supervised learning and multi-task reinforcement learning domains underscores the potency of PCGrad.

Supervised Learning

On datasets like CIFAR-100, CelebA, and NYUv2, PCGrad demonstrates marked performance improvements. When combined with capable architectures like routing networks in CIFAR-100, the addition of PCGrad leads to a considerable 2.8% boost in accuracy. Moreover, coupling PCGrad with leading multi-task models like MTAN achieves new performance benchmarks on the NYUv2 dataset, outperforming conventional models across multiple metrics.

Reinforcement Learning

The multi-task reinforcement learning setting further highlights PCGrad's effectiveness. When applied to MT10 and MT50 benchmarks from Meta-World, PCGrad significantly outperforms vanilla SAC (Soft Actor-Critic) and multi-head models. The method enhances average success rates, illustrating its capacity for data efficiency and robust performance across diverse manipulation tasks.

Implications and Future Work

The implications of PCGrad are manifold. Practically, it showcases a straightforward, model-agnostic method to enhance multi-task learning, promising more efficient training paradigms in RL and supervised learning contexts. Theoretically, it offers a nuanced understanding of multi-task gradient dynamics, paving the way for more refined optimization techniques.

Future developments might explore extended applications of PCGrad beyond the examined domains. Potential avenues include meta-learning, continual learning, and multi-agent systems, where gradient projections could address issues of stability and scalability.

Conclusion

In summary, "Gradient Surgery for Multi-Task Learning" introduces PCGrad, a theoretically sound and empirically validated method that effectively mitigates gradient conflicts in multi-task learning scenarios. This significant contribution holds promise not just for current multi-task learning problems but also for broader applications in machine learning, heralding a step towards more efficient and scalable AI.

Markdown Report Issue