Training Deeper Convolutional Networks with Deep Supervision (1505.02496v1)

Published 11 May 2015 in cs.CV

Abstract: One of the most promising ways of improving the performance of deep convolutional neural networks is by increasing the number of convolutional layers. However, adding layers makes training more difficult and computationally expensive. In order to train deeper networks, we propose to add auxiliary supervision branches after certain intermediate layers during training. We formulate a simple rule of thumb to determine where these branches should be added. The resulting deeply supervised structure makes the training much easier and also produces better classification results on ImageNet and the recently released, larger MIT Places dataset

Citations (178)

View on Semantic Scholar

Summary

The paper proposes adding auxiliary classifiers at intermediate layers to mitigate vanishing gradients and enable efficient training of deeper CNNs.
The methodology uses a gradient-based heuristic to strategically place deep supervision, resulting in improved convergence and robustness.
Experiments on ImageNet and MIT Places show enhanced accuracy and reduced training time compared to traditional deep network training.

Training Deeper Convolutional Networks with Deep Supervision

The paper presented focuses on enhancing convolutional neural networks (CNNs) by effectively training deeper architectures through the introduction of deep supervision. As recent recognition benchmarks such as the ILSVRC have demonstrated, increasing both the depth and width of CNNs tends to yield improved accuracy. However, this approach introduces significant challenges related to computational complexity and the risk of overfitting due to the sheer number of parameters involved.

Methodology

In this paper, the authors propose a method to mitigate these challenges by integrating auxiliary supervision branches at strategically chosen intermediate layers during the training process. The motivation is to address the vanishing gradient problem, which often hampers the training of very deep networks. The authors suggest adding auxiliary classifiers connected to certain convolutional layers to rectify this issue. The logic behind this approach is to ensure that feature maps at lower layers can directly contribute to the final classification labels, thereby facilitating more efficient gradient propagation throughout the network.

The determination of where to add these supervision branches follows a gradient-based heuristic rule: during initial iterations of training, the mean gradient values across layers are examined, and supervision is added where these values tend to vanish. This approach was applied to train models with configurations consisting of 8 and 13 convolutional layers, exploring depths beyond the original 5-layer AlexNet, thereby accommodating for deeper network architectures.

Experimental Results

The authors present a comprehensive performance comparison between networks trained conventionally and those utilizing deep supervision. On the ImageNet dataset, models trained with deep supervision (CNDS) showed superior classification accuracy over their baseline counterparts, indicating improved convergence and robustness with less computational time needed. Notably, an 8-layer CNDS model achieved 1% higher accuracy than a conventionally trained counterpart while requiring less training duration. A 13-layer CNDS model demonstrated further enhancements.

Additionally, experimental validation was conducted on the MIT Places dataset, emphasizing scene-centric images differing from the object-centric nature of ImageNet. The CNDS approach again offered notable improvements, outperforming a 5-layer baseline model and providing comparable accuracies to deeper structures, such as GoogleNet, with optimized training efficiency.

Implications and Future Directions

The proposed deep supervision method exhibits clear benefits regarding the efficiency and accuracy of training very deep networks. This suggests potential advantages over traditional pre-training there is less dependency on labor-intensive, iterative deepening strategies. Practically, this method simplifies training procedures and fosters swifter convergence to well-generalizing solutions. Theoretically, the auxiliary supervision offers new avenues for exploring deeper architectural designs and their potential in handling more complex visual data tasks.

As future developments, this methodology could be integrated with even more complex models, potentially including the use of residual connections or hybrid architectures that embrace other forms of regularization and optimization constraints. This paper aligns the community towards the direction of creating deeper neural networks without the prohibitive costs of pre-training and overly complex parameter tuning.

Overall, while not achieving the absolute highest accuracy compared to some other architectures examined, deep supervision presents itself as a promising paradigm for refining the training process of large-scale deep learning models.

Related Papers

Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)
Deep Networks with Stochastic Depth (2016)
Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks (2015)
Deeply-supervised Knowledge Synergy (2019)
Residual CNDS (2016)