Pay Less Attention with Lightweight and Dynamic Convolutions (1901.10430v2)

Published 29 Jan 2019 in cs.CL

Abstract: Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, LLMing and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.

Citations (588)

View on Semantic Scholar

Summary

The paper introduces lightweight convolutions that share weights and normalize via softmax, significantly reducing parameters while maintaining accuracy.
It presents dynamic convolutions that predict distinct kernels per time step, offering a linear complexity alternative to quadratic self-attention.
Experimental results show state-of-the-art BLEU scores in translation and a 20% runtime improvement, demonstrating versatile efficiency across NLP tasks.

Pay Less Attention with Lightweight and Dynamic Convolutions: An Overview

This paper presents a compelling alternative to self-attention in sequence modeling through the introduction of lightweight and dynamic convolutions. The proposed methods challenge the traditional dominance of self-attention by offering a simpler and more computationally efficient approach.

Key Contributions

Lightweight Convolutions: These are depth-wise separable convolutions that share weights across channels and normalize weights using a softmax layer. This approach significantly reduces the number of required parameters compared to non-separable convolutions and maintains strong performance by reusing the same weights for context elements regardless of the time step.
Dynamic Convolutions: Building upon lightweight convolutions, dynamic convolutions introduce a mechanism where separate kernels are predicted for each time step. This allows dynamic convolutions to vary weights over time, akin to self-attention, but with computational complexity scaling linearly with input length rather than quadratically.

Experimental Results

The experimental validation across multiple tasks—machine translation, LLMing, and abstractive summarization—demonstrates the efficacy of the proposed methods:

Machine Translation: Dynamic convolutions set a new state-of-the-art BLEU score of 29.7 on the WMT'14 English-German test set and matched previous best results on other benchmarks such as WMT English-French.
Efficiency: Dynamic convolutions achieved a 20% faster runtime compared to optimized self-attention models while maintaining or exceeding accuracy.
Task Versatility: The methods were competitive across various tasks, confirming their applicability beyond just translation.

Theoretical and Practical Implications

The findings suggest that the long-held belief in the necessity of content-based self-attention might be overestimated for some applications. The dynamic and lightweight convolutions offer practical benefits in scenarios where computational resources are limited or where efficiency is paramount, such as real-time language processing tasks.

Future Directions

This work opens avenues for further exploration in sequence modeling, especially in extending dynamic convolutions to other domains like computer vision or large-scale question answering systems. Additionally, the integration of these methods with reinforcement learning approaches could enhance performance further, particularly in scenarios with long input sequences.

In summary, this paper provides a well-articulated challenge to the self-attention paradigm, proposing a viable alternative with potential widespread applicability in natural language processing and beyond. The use of lightweight and dynamic convolutions may lead to more efficient and scalable models in future AI developments.

PDF Markdown