Are Transformers universal approximators of sequence-to-sequence functions? (1912.10077v2)

Published 20 Dec 2019 in cs.LG and stat.ML

Abstract: Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.

Citations (296)

View on Semantic Scholar

Summary

The paper establishes that Transformers are universal approximators for continuous permutation equivariant seq2seq functions using self-attention and feed-forward layers.
It demonstrates that adding trainable positional encodings allows approximation of arbitrary continuous seq2seq functions on compact domains.
The work introduces contextual mappings via self-attention, suggesting avenues for more efficient transformer architectures in practice.

Analysis of Transformers' Universal Approximation Capacity

The paper "Are Transformers universal approximators of sequence-to-sequence functions?" presents a pivotal analysis of the expressive capacity of Transformer models, specifically in the context of sequence-to-sequence (seq2seq) functions. Despite the widespread utilization of Transformers in NLP for tasks such as machine translation and LLMing, a rigorous understanding of their expressive power has remained elusive. This paper offers a comprehensive mathematical examination, proving that Transformers are universal approximators of continuous permutation equivariant seq2seq functions with compact support.

Main Contributions

The paper asserts several key contributions:

Universal Approximation Proof: It establishes that Transformers, with their self-attention and feed-forward layers, are universal approximators for continuous permutation equivariant seq2seq functions. This is significant given the Transformers' inherent architectural constraints, such as parameter sharing across tokens and dependency on pairwise dot-products for inter-token interactions.
Extension with Positional Encodings: By incorporating trainable positional encodings, the paper demonstrates that Transformers can approximate arbitrary continuous seq2seq functions on compact domains, surpassing the permutation equivariance limitation.
Contextual Mappings: The authors introduce and formalize the concept of contextual mappings — mappings that depend on the entire input sequence to represent the context of each token uniquely. They show that self-attention layers can compute such mappings, underscoring their role in the universal approximation capability of Transformers.
Experimental Evaluation: The paper explores other architectures that can compute contextual mappings to some extent, such as bi-linear projections and separable convolutions, indicating that substituting some layers in Transformers with these alternatives might enhance performance.

Theoretical Framework and Implications

The theoretical framework of universal approximation originates from classical neural network theory, which includes results demonstrating that neural networks can approximate any continuous function to arbitrary precision. This paper extends such results to the field of Transformer architectures for seq2seq functions, highlighting a nuanced understanding of the distinct roles of self-attention and feed-forward layers.

The implications of this research are significant for both theory and practice. Theoretically, it solidifies the understanding of Transformers' capacity for complex function representation. Practically, it suggests that more efficient architectures might be designed by leveraging the inherent abilities of Transformer components, potentially leading to more efficient training and deployment in NLP tasks.

Future Directions

The research opens several avenues for future exploration. Investigating alternative layers that can implement contextual mappings with reduced computational cost could lead to novel and efficient model designs. Moreover, further empirical studies to validate these findings across different tasks and domains would enhance the practical applicability of these theoretical insights. Additionally, exploring the exact mechanisms behind the contextual embeddings computed by Transformers — potentially beyond the mathematical formalization presented — could lead to even deeper insights into their expressiveness and capabilities.

In summary, the paper provides a rigorous and insightful analysis of the universal approximation capabilities of Transformers, offering valuable contributions to the understanding of their expressive power and setting the stage for future innovation in model architecture design.

PDF Markdown

Related Papers

Tweets

https://twitter.com/EthanBThoma/status/1918695478692126805

https://twitter.com/sameQCU/status/1888760538202767533

https://twitter.com/eternalblad3/status/1917957435299631570

https://twitter.com/dereklim_lzh/status/1841876036385976677

https://twitter.com/davidmanheim/status/1927607868766703732

YouTube

Show All Videos