Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (1502.01852v1)

Published 6 Feb 2015 in cs.CV, cs.AI, and cs.LG

Abstract: Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.

Citations (17,813)

View on Semantic Scholar

Summary

The paper introduces PReLU, a novel activation function that learns adaptive negative slopes to significantly reduce classification errors.
It presents a robust initialization strategy designed for very deep rectifier networks, enabling stable training of architectures with up to 30 layers.
Experimental results achieve a 4.94% top-5 error rate on ImageNet, marking a breakthrough that surpasses human performance.

The paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" introduces innovations in rectifier neural networks, achieving a reported 4.94\% top-5 error rate on the ImageNet 2012 classification dataset. This result marks a 26\% relative improvement over the ILSVRC 2014 winner, GoogLeNet, and is the first reported result to surpass human-level performance on this visual recognition challenge. The key contributions of this work are the introduction of Parametric Rectified Linear Units (PReLU) and a robust initialization method tailored for rectifier networks.

The authors propose PReLU (Parametric Rectified Linear Unit), a generalization of the traditional ReLU (Rectified Linear Unit). The formulation of PReLU is given by: $f(y_i) = \begin{cases} y_i, & \mbox{if } y_i > 0 \ a_i y_i, & \mbox{if } y_i \leq 0 \end{cases}$, where:

$y_i$ is the input to the activation function for the $i$ -th channel.
$a_i$ is a learnable coefficient that controls the slope of the negative part, allowing the nonlinear activation to vary across different channels.

This is equivalent to $f(y_i) = \max(0, y_i)+a_i\min(0, y_i)$ .

The authors detail the optimization process for PReLU, emphasizing that the parameters $a_i$ are learned using backpropagation. The gradient of $a_i$ is calculated as: $\frac{\partial \mathcal{E}}{\partial a_i} = \sum_{y_i} \frac{\partial \mathcal{E}}{\partial f(y_i)}\frac{\partial f(y_i)}{\partial a_i}$ , where:

$\mathcal{E}$ represents the objective function.
$\frac{\partial \mathcal{E}}{\partial f(y_i)}$ is the gradient propagated from the deeper layer.

The gradient of the activation is defined as: $\frac{\partial f(y_i)}{\partial a_i} = \begin{cases} 0, & \mbox{if } y_i > 0 \ y_i, & \mbox{if } y_i \leq 0 \end{cases}$.

The update rule for $a_i$ incorporates momentum: $\Delta a_i := \mu \Delta a_i + \epsilon \frac{\partial \mathcal{E}}{\partial a_i}$ , where:

$\mu$ is the momentum.
$\epsilon$ is the learning rate.

The authors avoid using weight decay ( $l_2$ regularization) on $a_i$ to prevent biasing PReLU towards ReLU.

The paper introduces a novel initialization method designed for very deep rectifier networks. This initialization technique accounts for the nonlinearities of ReLU/PReLU and facilitates the training of networks with up to 30 weight layers from scratch.

In the forward propagation case, the variance of the responses in each layer is analyzed. For a convolutional layer, the response is given by: $y_l = W_l x_l + b_l$ , where:

$x_l$ is a $k^2c$ -by-1 vector representing co-located $k \times k$ pixels in $c$ input channels.
$W$ is a $d$ -by- $n$ matrix, with $n = k^2c$ denoting the number of connections and $d$ being the number of filters.
$b$ is a vector of biases.
$y$ is the response at a pixel of the output map.
$l$ indexes the layer.

The variance of $y_l$ is expressed as: $Var[y_{l}]=n_lVar[w_{l}x_l] = n_lVar[w_{l}]E[x^2_{l}]$ , where:

$w_{l}$ is assumed to have zero mean.
$E[x^2_{l}]$ is the expectation of the square of $x_l$ .

For ReLU activations, $E[x^2_{l}]=\frac{1}{2}Var[y_{l-1}]$ , leading to: $Var[y_{l}]=\frac{1}{2}n_lVar[w_{l}]Var[y_{l-1}]$ . After $L$ layers: $Var[y_{L}]=Var[y_{1}]\left(\prod_{l=2}^{L}\frac{1}{2}n_lVar[w_{l}]\right)$ . To avoid exponential scaling, the initialization satisfies: $\frac{1}{2}n_lVar[w_{l}]=1, \quad \forall l$ .

This results in a zero-mean Gaussian distribution with a standard deviation of $\sqrt{2/n_l}$ for initializing the weights.

For back-propagation, the gradient is computed by: $\Delta x_l = \hat{W}_l \Delta y_l$ , where:

$\Delta x$ and $\Delta y$ denote the gradients with respect to $x$ and $y$ , respectively.
$\hat{W}$ is a $c$ -by- $\hat{n}$ matrix with filters rearranged for back-propagation, where $\hat{n}=k^2d$ .

The variance of the gradient is: $Var[\Delta x_l] = \hat{n}_lVar[w_l]Var[\Delta y_l] = \frac{1}{2}\hat{n}_lVar[w_l]Var[\Delta x_{l+1}]$ .

After $L$ layers: $Var[\Delta x_2] = Var[\Delta x_{L+1}]\left(\prod_{l=2}^{L}\frac{1}{2}\hat{n}_{l}Var[w_{l}]\right)$ , leading to the initialization condition: $\frac{1}{2}\hat{n}_lVar[w_{l}]=1, \quad \forall l$ .

This gives a zero-mean Gaussian distribution with a standard deviation of $\sqrt{2/\hat{n}_l}$ .

For PReLU, the initialization conditions become: $\frac{1}{2}(1+a^2)n_lVar[w_{l}]=1, \quad \forall l$ and $\frac{1}{2}(1+a^2)\hat{n}_lVar[w_{l}]=1$ , where $a$ is the initialized value of the PReLU coefficients.

The architectures used in this paper are based on the VGG-19 model with modifications such as adjusting filter sizes, strides, and incorporating Spatial Pyramid Pooling (SPP). Model A is a 19-layer network, model B is a deeper variant with 22 layers, and model C is a wider version of B with more filters.

The training algorithm involves data augmentation techniques such as random cropping, scale jittering (with scales $s$ in the range of $[256, 512]$ ), horizontal flipping, and random color altering. The weight decay is 0.0005, and momentum is 0.9. Dropout (50\%) is applied to the first two fully connected layers. The mini-batch size is 128.

For testing, a multi-view testing strategy on feature maps is used, combined with a dense sliding window method. The convolutional layers are applied to the resized full image, and the SPP layer is used for pooling. The scores are averaged across all dense sliding windows and multiple scales.

The authors conduct experiments on the ImageNet 2012 dataset, comparing ReLU and PReLU on model A. The results demonstrate that PReLU reduces the top-1 error by 1.05\% and the top-5 error by 0.23\% compared to ReLU in the multi-scale combination. The best single model (C, PReLU) achieves a 5.71\% top-5 error. The multi-model combination of six models achieves a 4.94\% top-5 error on the test set, outperforming the ILSVRC 2014 winner, GoogLeNet.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ducha_aiki/status/1777765673650778603

https://twitter.com/ducha_aiki/status/1777765239880081688

https://twitter.com/eliteplayzXD/status/1790443223367733677

https://twitter.com/ducha_aiki/status/1875221895190343796

https://twitter.com/PaulGavrikov/status/1793595921701519534

YouTube

Show All Videos