Deep Residual Learning for Image Recognition (1512.03385v1)

Published 10 Dec 2015 in cs.CV

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Citations (179,673)

View on Semantic Scholar

Summary

The paper introduces a residual learning framework that trains very deep networks by learning residual functions rather than direct mappings.
The approach employs shortcut connections to mitigate vanishing gradients, achieving a 3.57% top-5 error on ImageNet with a 152-layer network.
The method improves optimization in object detection and CIFAR-10 tasks, outperforming traditional architectures like VGG nets.

Deep Residual Learning for Image Recognition

Introduction to Deep Residual Learning

The paper "Deep Residual Learning for Image Recognition" introduces a residual learning framework to tackle the difficulty of training deeper neural networks. Typically, as network depth increases, training becomes challenging due to the problem of vanishing/exploding gradients, as well as degradation in accuracy. The authors propose a residual learning paradigm, where layers learn residual functions with reference to inputs instead of direct mappings. This approach allows networks to become substantially deeper.

Residual Learning Framework

The central innovation of the paper is the reformulation of network layers to focus on learning residual functions. Instead of fitting $\mathcal{H}(x)$ , they aim to approximate $\mathcal{F}(x) = \mathcal{H}(x) - x$ and redefine the original function to be $\mathcal{F}(x) + x$ . The residual learning framework incorporates shortcut connections which perform identity mapping, allowing added layers to learn the residual component efficiently without introducing extra complexity or parameters.

The shortcut connections are realized using identity mappings, and in cases of dimensional changes, linear projections are used to maintain consistency. This architecture modification is pivotal in addressing degradation problems in deep networks, as it simplifies optimization by preconditioning the function mapping towards the identity.

Network Architectures and Implementation

The paper proposes two network configurations: a plain network inspired by VGG nets and a residual network incorporating shortcut connections. The residual network is designed to have lower complexity than traditional models, such as VGG nets, while achieving greater accuracy due to deeper layers.

Plain Network: Inspired by the VGG architecture, features stacked layers with 3x3 convolutions, preserving computation complexity across different layers sizes.
Residual Network: Introduces shortcut connections in every few layers to facilitate residual learning, significantly improving optimization.

Implementation utilizes batch normalization, SGD with backpropagation, and scale/color augmentation of inputs. Testing adopts a multi-crop strategy for accuracy evaluation, achieving competitive results with fewer resources.

Experimental Results and Analysis

Extensive experiments conducted on the ImageNet dataset demonstrate the efficacy of residual networks:

Depth and Accuracy: Residual networks dramatically improve accuracy as the depth increases, evidenced by better results with a 152-layer residual network compared to shallower counterparts.
Performance Gains: Residual networks achieve 3.57% top-5 error on ImageNet, a substantial improvement over traditional architectures.

Additionally, experiments on CIFAR-10 showcase the optimization benefits of residual networks, achieving state-of-the-art accuracy with models extending beyond 1000 layers.

Object Detection and Practical Implications

The residual networks also lead to remarkable improvements in object detection tasks on PASCAL VOC and MS COCO datasets. This involves using a modified Faster R-CNN framework leveraging residual features for refining both recognition and localization.

Conclusion

The proposed residual learning framework addresses fundamental issues in training deep networks, resulting in significant accuracy improvements with reduced complexity. The integration of shortcut connections promotes efficient optimization and generalizes well across different tasks, setting a precedent for future network architecture designs in both vision and non-vision domains.

The paper's contributions underscore residual learning as a robust methodology for enabling deep architecture designs that harness depth more effectively without succumbing to traditional optimization challenges. Future developments may explore enhancements in regularization and further applications across diverse fields.