Linear Self-Attention Approximation via Trainable Feedforward Kernel (2211.04076v1)

Published 8 Nov 2022 in cs.LG and cs.AI

Abstract: In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce the number of attended keys; other ways to reduce complexity include locality-sensitive hashing, key pooling, additional memory to store information in compacted or hybridization with other architectures, such as CNN. Often based on a strong mathematical basis, kernelized approaches allow for the approximation of attention with linear complexity while retaining high accuracy. Therefore, in the present paper, we aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.

Summary

The paper proposes a novel feedforward kernel approach that approximates self-attention, reducing computational complexity from quadratic to linear.
It utilizes trainable kernel functions, including configurations with GLUs and orthogonal regularization, to model efficient attention mechanisms with minimal parameter overhead.
Empirical evaluations on Long Range Arena tasks demonstrate that the proposed method achieves competitive or superior performance compared to traditional Transformer models.

Linear Self-Attention Approximation via Trainable Feedforward Kernel

Overview

The paper "Linear Self-Attention Approximation via Trainable Feedforward Kernel" addresses the computational limitations inherent in standard Transformer models due to the quadratic complexity of self-attention, especially for long sequences. The researchers propose a method to approximate the self-attention mechanism using trainable kernel functions implemented through feedforward neural networks (FFNs), significantly reducing the computational complexity to linear. The work is evaluated using tasks from the Long Range Arena benchmark, providing an empirical basis to assess the model's performance.

Kernelized Attention Mechanism

The paper is built upon the concept of kernelized self-attention, where the attention operation is decomposed into a series of projections and dot products. The kernel function $\kappa(q_i, k_j)$ is approximated as $\phi(q_i)^T \phi(k_j)$ , where $\phi(\cdot)$ is a function that projects inputs to a higher-dimensional space. This paper extends the traditional softmax-based attention using a trainable feedforward approach to achieve linear complexity while retaining computation efficiency. The proposed trainable kernel relies on the ability of FFNs, specifically employing positive projection functions such as Softplus, to function as universal approximators.

Feedforward and Gated Linear Units (GLU)

The authors explore several configurations of the projection function $\phi(\cdot)$ . Starting with a basic single-layer FFN using Softplus activation, they find it offers competitive results compared to other efficient attention mechanisms like Performer. The paper also investigates more complex layers, such as Gated Linear Units (GLUs), which incorporate a gating mechanism. These units introduce element-wise nonlinearity and allow modeling of more complex functions, though at the cost of increased parameterization.

An orthogonal regularization strategy is employed to enhance performance by leveraging orthogonal initialization of weights, drawing from techniques in the literature on orthogonal deep networks. Through empirical evaluation, regularized GLUs (OGLUs) are shown to exhibit fast convergence and reduced variance across different training runs.

Experimental Evaluation

The researchers conduct experiments on three tasks from the Long Range Arena: text classification, document matching, and ListOps. The models are constrained to maintain additional parameterization below 10% compared to baseline models to ensure fair comparison. The empirical results demonstrate that the single-layer OGLU achieves superior or competitive performance to the state-of-the-art models in these tasks.

Implications and Future Directions

The findings in this paper suggest a promising direction for reducing the computational footprint of Transformer-like architectures using trainable kernels. By leveraging feedforward neural networks as kernel approximators, one can maintain the computational efficiency and apply Transformers to longer sequences, which is crucial for tasks involving extensive text or time-series data.

Future work may further optimize kernel approximation techniques, explore more sophisticated gating mechanisms, or apply these methods in other domains, such as computer vision or reinforcement learning. Additionally, potential applications could draw benefits from the architecture's reduced complexity for deployment in resource-constrained environments, such as mobile devices or real-time systems.

Conclusion

This paper presents a novel approach to approximate self-attention in Transformers using trainable feedforward kernels. By reducing complexity from quadratic to linear, the authors offer an efficient solution that can handle long sequences without sacrificing performance. The experimental results validate the efficacy of the proposed method, opening avenues for further research in efficient attention mechanisms and their applications in large-scale data processing tasks.