An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (1803.01271v2)

Published 4 Mar 2018 in cs.LG, cs.AI, and cs.CL

Abstract: For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .

Citations (4,287)

View on Semantic Scholar

Summary

The paper's main contribution is the demonstration that a generic TCN consistently outperforms RNNs on a suite of sequence modeling benchmarks.
It leverages causal and dilated convolutions with residual connections to achieve longer effective memory and stable gradients.
Empirical results on synthetic and real-world tasks, including language and music modeling, support TCN's superior performance.

This paper, "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling" (An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, 2018), challenges the common assumption that recurrent neural networks (RNNs), such as LSTMs and GRUs, are the default architecture for sequence modeling tasks. The authors conduct a systematic empirical evaluation comparing a generic convolutional architecture, termed a Temporal Convolutional Network (TCN), against canonical RNNs on a wide range of sequence modeling benchmarks.

The core contribution is the demonstration that a simple TCN architecture consistently outperforms generic recurrent networks on tasks commonly used to benchmark RNNs, including synthetic stress tests and real-world datasets like polyphonic music and LLMing. Furthermore, the paper presents evidence that TCNs exhibit significantly longer effective memory in practice compared to LSTMs and GRUs of similar capacity.

The proposed Temporal Convolutional Network (TCN) is a generic convolutional architecture designed for sequence modeling. Its distinguishing characteristics for practical implementation are:

Causal Convolutions: Ensures that the prediction at time $t$ only depends on inputs from time $t$ and earlier. This is achieved using 1D fully-convolutional networks (FCNs) with zero padding of length (kernel size - 1) to maintain the same sequence length across layers, and applying the convolution mask such that it only connects to past inputs. This is crucial for autoregressive tasks where future information is not available.
Dilated Convolutions: Enables the network to have an exponentially large receptive field with respect to its depth. For a 1D sequence $x$ and filter $f$ , the dilated convolution at element $s$ is defined as $F(s) = (\mathbf{x} *_d f)(s) = \sum_{i=0}^{k-1} f(i) \cdot \mathbf{x}_{s-d \cdot i}$ , where $d$ is the dilation factor, $k$ is the filter size. The authors use exponentially increasing dilation factors ( $d = O(2^i)$ at layer $i$ ) to cover a wide historical context efficiently. This is a practical method to increase the effective memory of the network without dramatically increasing the kernel size or depth linearly.
Residual Connections: Borrowed from ResNet (Deep Residual Learning for Image Recognition, 2015), these connections add the input of a layer to its output ( $o = \text{Activation}(\mathbf{x} + \mathcal{F}(\mathbf{x}))$ ). In the TCN residual block, this involves two layers of dilated causal convolution, ReLU activation, weight normalization, and spatial dropout. A 1x1 convolution is used to match dimensions if the input and output widths differ. This is important for building very deep networks necessary for large receptive fields and helps stabilize training and improve performance.

Practical Advantages of TCNs highlighted:

Parallelism: Convolutions can be computed in parallel across timesteps in a layer, unlike the sequential nature of RNNs. This leads to faster training and inference for long sequences.
Flexible Receptive Field Size: The effective history size can be easily controlled by adjusting the network depth, filter size, and dilation factors. This makes TCNs adaptable to tasks requiring different amounts of historical context.
Stable Gradients: The backpropagation path in TCNs does not follow the temporal direction, mitigating vanishing/exploding gradient problems common in deep RNNs. This simplifies training compared to basic RNNs.
Low Memory Requirement for Training: TCNs generally require less memory during training compared to gated RNNs (like LSTMs/GRUs) which store states for multiple gates.
Variable Length Inputs: Like RNNs, TCNs can process sequences of arbitrary length using 1D convolutional kernels that slide over the input.

Practical Disadvantages/Considerations:

Data Storage during Evaluation: Unlike RNNs that can process sequence elements one by one while maintaining a fixed-size hidden state "summary" of the past, TCNs require the raw sequence up to the receptive field size as input for prediction. This can potentially increase memory usage during inference for very long sequences if the entire history needs to be buffered.
Domain Transfer: If transferring a pre-trained TCN from a domain requiring little memory to one requiring much longer memory, the network's receptive field (determined by architecture hyperparameters) might be insufficient, potentially requiring changes to the architecture parameters.

Experimental Evaluation and Results:

The authors evaluated TCNs and canonical RNNs (LSTM, GRU, vanilla RNN) on a suite of tasks including:

Synthetic: Adding Problem, Sequential MNIST, Permuted MNIST, Copy Memory. These tasks specifically test the network's ability to capture long-term dependencies and memory retention.
Real-world: Polyphonic Music Modeling (JSB Chorales, Nottingham), Word-level LLMing (PTB, WikiText-103, LAMBADA), Character-level LLMing (PTB, text8).

The results consistently showed that the generic TCN architecture outperformed canonical LSTMs and GRUs across these tasks. For instance, on the Copy Memory task with sequence length T=1000, TCN achieved a loss of 3.5e-5, significantly better than LSTM (0.0204) and GRU (0.0197). On Permuted MNIST, TCN reached 97.2% accuracy compared to LSTM's 85.7% and GRU's 87.3%. On LLMing benchmarks like Wikitext-103 and LAMBADA, TCN achieved lower perplexity than LSTMs reported in prior work, indicating better performance on tasks requiring understanding long context.

A key finding is the analysis of effective memory. On the Copy Memory task with increasing sequence length $T$ , TCN maintained 100% accuracy, while LSTM and GRU accuracy dropped significantly for $T$ as low as 50 and 200 respectively (for 10K parameter models). This empirically shows that TCNs can capture much longer dependencies in practice, challenging the notion of RNNs' theoretical infinite memory being superior in real-world applications.

Implementation Considerations:

The TCN architecture is based on standard convolutional layers, dilation, and residual connections, which are readily available in deep learning frameworks (e.g., PyTorch, TensorFlow).
Implementing causal convolutions requires careful padding to ensure that output at time $t$ only depends on inputs up to time $t$ . This can be achieved using padding=(kernel_size - 1) * dilation_factor.
The receptive field size of a TCN layer with kernel size $k$ , dilation $d$ , and layer index $i$ (starting from 0) is influenced by previous layers. For a stack of layers with kernel size $k$ and dilation factors $d_0, d_1, \dots, d_{n-1}$ , the receptive field size of the final layer output element is $1 + \sum_{i=0}^{n-1} (k-1) d_i$ . For exponential dilations $d_i = k_{base}^i$ , this grows exponentially. Practitioners need to calculate the required receptive field size based on the task's maximum dependency length and configure $k$ , $n$ (number of layers), and the base for dilation accordingly.
Hyperparameter tuning for TCNs involves selecting the number of layers ( $n$ ), kernel size ( $k$ ), and dilation factors (e.g., base for exponential increase). The authors found TCNs relatively insensitive to hyperparameters provided the receptive field is sufficient. They also used standard techniques like weight normalization and spatial dropout for regularization.
The authors' code release at http://github.com/locuslab/TCN provides a practical starting point for implementing TCNs.

Conclusion:

The paper concludes that TCNs, with their combination of causal convolutions, dilated convolutions, and residual connections, provide a powerful and often more effective alternative to recurrent networks for sequence modeling tasks. Their advantages in parallelism, flexible receptive field control, and stable gradients make them a compelling starting point for practitioners working with sequential data. While state-of-the-art on some tasks might be achieved by highly specialized RNN variants, the generic TCN demonstrated superior performance over generic LSTMs and GRUs, suggesting a shift in the perception of which architecture is the natural first choice for sequence modeling.

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (1803.01271v2)

Summary

GitHub

YouTube

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (1803.01271v2)

Summary

Related Papers

GitHub

YouTube