Resnet in Resnet: Generalizing Residual Architectures (1603.08029v1)

Published 25 Mar 2016 in cs.LG, cs.CV, cs.NE, and stat.ML

Abstract: Residual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.

Citations (734)

View on Semantic Scholar

Summary

The paper introduces a novel architecture by integrating a residual and a transient stream in a generalized residual block.
It enhances learning by allowing flexible feature processing, overcoming limitations of standard ResNets.
Experiments show that RiR achieves state-of-the-art performance on CIFAR-100 while enabling deeper residual learning.

This paper introduces a new type of neural network architecture called ResNet in ResNet (RiR) that builds upon existing residual networks (ResNets). ResNets are a popular type of deep learning model that have achieved excellent performance on image recognition tasks. The key idea behind ResNets is the use of "identity shortcut connections," which allow information to flow through the network without being altered.

Here's a breakdown of the paper:

Background and Motivation

Deep learning models, especially ResNets, have become very successful in areas like image classification. ResNets use a special structure called "identity shortcut connections". These connections help data flow through the network's layers more easily, which makes the network better at learning. However, the authors of this paper point out some possible limitations with the standard ResNet design:

In ResNets, these identity connections cause a mix of different types of features to accumulate in each layer. The problem is that some features learned in earlier layers might not be helpful in later layers.
ResNets might have trouble removing unnecessary information because it's difficult for them to "unlearn" the identity weights (the weights associated with the identity shortcut connections).
The structure of ResNet blocks forces the network to learn residuals (the difference between the input and output of a layer) using shallow subnetworks, even though deeper networks might be better at expressing complex relationships.

The Proposed Solution: ResNet in ResNet (RiR)

To address these limitations, the authors introduce a new architecture called ResNet in ResNet (RiR). RiR combines ResNets with standard convolutional neural networks (CNNs) in a parallel structure. This means that the RiR architecture has two streams of information flowing through it: a "residual stream" (like in a regular ResNet) and a "transient stream" (like in a regular CNN).

Generalized Residual Block

The basic building block of the RiR architecture is the "generalized residual block." This block has two parallel streams:

Residual Stream (r): This stream is similar to a standard ResNet block, with identity shortcut connections and a convolutional layer.
Transient Stream (t): This stream is a standard convolutional layer without shortcut connections.

These streams are connected in such a way that information can flow between them. Specifically, there are convolutional filters that transfer information from the residual stream to the transient stream ( $W_{l, r \to t}$ ) and from the transient stream to the residual stream ( $W_{l, t \to r}$ ).

Here are the equations that define how information flows through the generalized residual block:

$\mathbf{r}_{l+1} = \sigma(\mbox{conv}(\mathbf{r}_l, W_{l, r \to r}) + \mbox{conv}(\mathbf{t}_l, W_{l, t \to r}) + \mbox{shortcut}(\mathbf{r}_l))$

$\mathbf{t}_{l+1} = \sigma(\mbox{conv}(\mathbf{r}_l, W_{l, r \to t}) + \mbox{conv}(\mathbf{t}_l, W_{l, t \to r}))$

Where:

$\mathbf{r}_l$ is the output of the residual stream at layer $l$ .
$\mathbf{t}_l$ is the output of the transient stream at layer $l$ .
$W_{l, r \to r}$ is the convolutional filter that transforms the residual stream in the residual stream.
$W_{l, t \to r}$ is the convolutional filter that transforms the transient stream to the residual stream.
$W_{l, r \to t}$ is the convolutional filter that transforms the residual stream to the transient stream.
$W_{l, t \to t}$ is the convolutional filter that transforms the transient stream in the transient stream.
$\mbox{conv}(x, W)$ denotes a convolution operation on $x$ using the filter $W$ .
$\mbox{shortcut}(\mathbf{r}_l)$ represents the identity shortcut connection, which simply passes the input $\mathbf{r}_l$ to the output.
$\sigma$ represents a combination of batch normalization and ReLU nonlinearities.

In essence, the residual stream works like a regular ResNet, while the transient stream adds the ability to process information non-linearly and discard information from earlier layers.

ResNet Init

The authors also introduce a modified initialization scheme called "ResNet Init." This scheme allows the generalized residual block to be implemented as a single convolutional layer with a special initialization of its weights.

How RiR Works

RiR replaces the standard convolutional layers within a ResNet block with the generalized residual blocks described above. This creates a "ResNet inside a ResNet" structure, hence the name. The generalized residual block can function as a standard CNN layer or a single-layer ResNet block by learning to zero out either the residual or transient stream. By repeating the generalized residual block, the architecture gains the ability to learn anything between these two extremes. This enables the network to learn residuals with a varying number of processing steps before being added back into the residual stream.

Experiments and Results

The authors evaluated their RiR architecture on the CIFAR-10 and CIFAR-100 datasets, which are commonly used for image classification. They compared RiR to other state-of-the-art architectures, including standard ResNets and CNNs.

The results showed that RiR consistently outperformed ResNets and achieved state-of-the-art results on CIFAR-100. The authors also found that a wider version of RiR (with more filters in the convolutional layers) was particularly effective.

Key Findings

RiR improves upon ResNets by combining residual and non-residual streams.
The generalized residual block allows the network to learn more complex representations.
RiR achieves state-of-the-art performance on CIFAR-100.
The RiR architecture is robust to increasing the depth of residual blocks and allows for training of deeper residuals compared to the original ResNet.

In Simple Terms

Imagine you're trying to assemble a puzzle. A regular ResNet is like having a set of instructions that you must follow step-by-step. RiR is like having those same instructions, but also having the freedom to explore other possibilities and try different combinations of pieces. The "residual stream" keeps you on track with the original instructions, while the "transient stream" allows you to experiment and potentially find a better solution. This combination of structure and flexibility is what makes RiR so powerful.

PDF Markdown

Resnet in Resnet: Generalizing Residual Architectures (1603.08029v1)

Summary

Related Papers