Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Published 20 May 2016 in cs.CV, cs.AI, cs.LG, and cs.NE | (1605.06431v2)

Abstract: In this work we propose a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length. Moreover, residual networks seem to enable very deep networks by leveraging only the short paths during training. To support this observation, we rewrite residual networks as an explicit collection of paths. Unlike traditional models, paths through residual networks vary in length. Further, a lesion study reveals that these paths show ensemble-like behavior in the sense that they do not strongly depend on each other. Finally, and most surprising, most paths are shorter than one might expect, and only the short paths are needed during training, as longer paths do not contribute any gradient. For example, most of the gradient in a residual network with 110 layers comes from paths that are only 10-34 layers deep. Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks.

Abstract PDF Upgrade to Chat

Citations (106)

View on Semantic Scholar

Summary

The paper's main contribution is reinterpreting Residual Networks as ensembles of shorter, effective paths rather than as a single monolithic deep network.
It demonstrates through experiments that in a 110-layer ResNet, the majority of gradient updates stem from paths only 10 to 34 layers deep.
The findings challenge traditional deep learning paradigms and encourage future network designs to optimize effective short path configurations.

Residual Networks: An Ensemble Perspective on Training Deep Architectures

The paper authored by Veit, Wilber, and Belongie presents an innovative interpretation of Residual Networks (ResNets), proposing that their architecture can be conceptualized as a collection of paths with varying lengths rather than a single monolithic deep network. This reinterpretation challenges traditional views on neural networks' depth and introduces new insights into why ResNets outperform their predecessors in training very deep networks.

The researchers embark on a comprehensive examination of ResNets, offering a fresh understanding of their performance dynamics. Their key contribution lies in presenting the "unraveled view", which illustrates how ResNets can be decomposed into many paths. This decomposition is facilitated by identity skip connections, a hallmark of ResNets, which permit information to bypass certain network layers, thus enabling data to traverse paths of differing lengths. Empirical evidence for this view is provided by experiments demonstrating that paths in ResNets behave similarly to ensembles, where the removal of certain path components has limited impact on the overall network performance. This is contrary to conventional network architectures like VGG or AlexNet, where the removal of even a single layer induces substantial performance degradation.

A standout discovery from this paper is that most gradient contributions during ResNet training come from paths considerably shorter than the total network depth. For instance, in a 110-layer deep ResNet, the majority of effective paths contributing gradient are only between 10 to 34 layers deep. The authors argue that deep paths, although present, do not significantly contribute to gradient updates due to the inherent vanishing gradient phenomenon. This insight reveals a fundamental aspect of ResNets' mechanics—by leveraging short paths, they effectively circumvent vanishing gradients, enabling successful training even with significantly increased depth.

The implications of these findings are manifold. Practically, this suggests that while it is possible to go deeper with network architectures, the effective pathways that contribute to learning remain relatively shallow, thus questioning the straightforward notion that "deeper is better". Theoretically, this reevaluation shifts the discourse from deep feature hierarchies to the ensemble behavior of paths in networks, with the number of paths playing a role analogous to ensemble size in conventional ensemble learning.

The results remind researchers and practitioners of the importance of re-examining established paradigms in neural network design. Short paths in ResNets being the primary contributors during training challenges the community to reconsider how network depth is conceptualized, potentially inviting new network designs that prioritize effective short path configurations.

Speculations for future developments steer toward investigating further applications of the ensemble and path-centric frameworks beyond ResNets. Understanding the intricacies of gradient flows in networks and leveraging these insights might unlock new capabilities in other architectures, potentially leading to advancements in model complexity management and interpretability.

In conclusion, the interpretations and experiments presented by Veit et al. provoke a rethinking of the fundamental aspects of deep learning networks, where the idea of depth may be complemented by a focus on effective path configurations and ensemble functionalities within architectures. Such insights pave the way for future explorations in both theoretical inquiries and practical network design strategies within the context of artificial intelligence.

Markdown