Saturated Transformers are Constant-Depth Threshold Circuits

Published 30 Jun 2021 in cs.CL, cs.CC, and cs.LG | (2106.16213v3)

Abstract: Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND/OR circuits (Hao et al. 2021). However, hard attention is a strong assumption, which may complicate the relevance of these results in practice. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We first show that saturated transformers transcend the known limitations of hard-attention transformers. We then prove saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class $\mathsf{TC}^0$ as an upper bound on the formal languages they recognize.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (76)

View on Semantic Scholar

Summary

The paper shows that saturated attention extends transformers' expressive power by enabling simulation with TC^0 circuits, transcending hard attention limitations.
The methodology employs formal circuit complexity analysis to compare hard and saturated attention, highlighting significant computational benefits.
The findings open new research avenues by linking neural network architectures with circuit complexity theory for practical NLP applications.

The Power of Saturated Transformers

The paper "Hard Attention Isn't All You Need: The Power of Saturated Transformers" provides a comprehensive analysis of the theoretical capabilities of transformers with saturated attention. This work seeks to address the limitations that arise when transformers are assumed to operate under hard attention mechanisms—where attention is placed wholly on a single position—and explores the expanded capabilities when employing saturated attention.

Background and Context

Transformers have established themselves as a fundamental architecture for NLP tasks, necessitating a deeper understanding of their theoretical capabilities. Recent studies have highlighted constraints on transformers with hard attention, suggesting that such models can be simulated by $AC^0$ circuits—reflecting their limited expressive power. However, real-world transformers typically use more nuanced attention distributions, leading to the exploration of models with saturated attention that provide a more realistic approximation of practical implementations.

Saturated vs Hard Attention

Saturated attention generalizes hard attention by averaging focus across multiple positions. It allows for attention distribution across tied subsets rather than a single index. This form of attention is argued to align better with the patterns that are learned by transformers during training, making them more capable than their hard attention counterparts. This paper establishes that the use of saturated attention extends the linguistic and computational power of transformers beyond the constraints identified for hard attention models.

Key Results and Implications

Through formal circuit complexity analysis, the paper demonstrates several pivotal results. Saturated transformers have the capability to recognize languages beyond the $AC^0$ class, including the majority language, known to lie outside $AC^0$ . This inclusion provides evidence that saturated attention enhances the computational abilities of transformers. More precisely, the paper establishes that transformers with saturated attention, when operating on floating-point representations, can be simulated by $TC^0$ circuits. This means while transformers cannot recognize arbitrary languages without constraints (as might be suggested if they used rational numbers), they exhibit considerable capability within practical computational bounds due to their increased expressive power compared to hard-attention models.

Future Directions

These findings suggest several avenues for future exploration. One potential direction is to further investigate the intersection of neural network architectures and circuit complexity theory, aiming to define the precise boundaries and hierarchies of expressiveness and computational power across various model types. There is also room to explore the implications of uniformity constraints on these models—an area somewhat touched upon when considering how practical implementations might resemble the uniform variants of standard circuit complexity classes.

Conclusion

The paper suggests a reevaluation of prior theoretical limitations imposed by hard attention and proposes saturated attention as a model that better captures practical transformer capabilities. By situating saturated transformers within the $TC^0$ complexity class, this research opens up a clearer understanding of transformers' theoretical underpinnings and provides a pathway for further exploration and application in computational linguistics and artificial intelligence domains.