Graph Convolutions Enrich the Self-Attention in Transformers! (2312.04234v5)
Abstract: Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.
- Unified pre-training for program understanding and generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.Ā 2655ā2668, 2021.
- Centered self-attention layers. arXiv preprint arXiv: 2306.01610, 2023.
- Improving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pp.Ā 1ā18. Springer, 2022.
- The fifth pascal recognizing textual entailment challenge. TAC, 7:8, 2009.
- GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318, 2020.
- Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
- Alleviating over-smoothing for unsupervised sentence representation. arXiv preprint arXiv:2305.06154, 2023.
- The principle of diversity: Training stronger vision transformers calls for reducing all levels of redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.Ā 12020ā12030, 2022.
- Quora question pairs, 2018.
- Adaptive universal generalized PageRank graph neural network. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Gread: Graph neural reaction-diffusion networks. In International Conference on Machine Learning (ICML), pp.Ā 5722ā5747. PMLR, 2023.
- Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.Ā 4171ā4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), pp.Ā 2793ā2803. PMLR, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1ā48, 2023.
- CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.Ā 1536ā1547, 2020.
- Diffusion improves graph learning. In Advances in Neural Information Processing Systems (NeurIPS), volumeĀ 32, 2019.
- Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753, 2021.
- Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML), volumeĀ 32, pp.Ā 1764ā1772. PMLR, 22ā24 Jun 2014.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
- ContraNorm: A contrastive learning perspective on oversmoothing and beyond. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.Ā 770ā778, 2016.
- OGB-LSC: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.Ā 4700ā4708, 2017.
- Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757ā1768, 2012.
- Nicolas Keriven. Not too little, not too much: a theoretical analysis of graph (over) smoothing. In Advances in Neural Information Processing Systems (NeurIPS), volumeĀ 35, pp.Ā 2268ā2281, 2022.
- Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
- Transformers in speech processing: A survey. arXiv preprint arXiv:2303.11607, 2023.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.Ā 7871ā7880, 2020.
- RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.Ā 10012ā10022, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Building a large annotated corpus of english: The penn treebank. 1993.
- Signal processing on directed graphs: The role of edge directionality when processing and learning from network data. IEEE Signal Processing Magazine, 37(6):99ā116, 2020.
- A fractional graph laplacian approach to oversmoothing. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), volumeĀ 35, pp.Ā 27198ā27211, 2022.
- Graph neural networks exponentially lose expressive power for node classification. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.Ā 5206ā5210. IEEE, 2015.
- Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech 2019, 2019.
- Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning (ICML), pp.Ā 17627ā17643. PMLR, 2022.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485ā5551, 2020.
- Recipe for a general, powerful, scalable graph transformer. In Advances in Neural Information Processing Systems (NeurIPS), volumeĀ 35, pp.Ā 14501ā14515, 2022.
- SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
- A survey on oversmoothing in graph neural networks. arXiv preprint arXiv: Arxiv-2303.10993, 2023.
- Discrete signal processing on graphs. IEEE transactions on signal processing, 61(7):1644ā1656, 2013.
- Discrete signal processing on graphs: Frequency analysis. IEEE Transactions on Signal Processing, 62(12):3042ā3054, 2014.
- Green AI. Communications of the ACM, 63(12):54ā63, 2020.
- Revisiting over-smoothing in bert from the perspective of graph. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.Ā 1631ā1642, 2013.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pp.Ā 10347ā10357. PMLR, 2021a.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.Ā 32ā42, 2021b.
- Attention is all you need. In Advances in neural information processing systems (NeurIPS), volumeĀ 30, 2017.
- Graph Attention Networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- Multi-hop attention graph neural network. In IJCAI, 2021a.
- Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. arXiv preprint arXiv:1811.11934, 2018.
- Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp.Ā 261ā271, 2020. doi: 10.1109/SANER48275.2020.9054857.
- CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.Ā 8696ā8708, 2021b.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625ā641, 2019.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
- CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.Ā 22ā31, 2021.
- Demystifying oversmoothing in attention-based graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2023a.
- A non-asymptotic analysis of oversmoothing in graph neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2023b.
- Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning (ICML), pp.Ā 5453ā5462, 2018.
- Addressing token uniformity in transformers via singular value transformation. In Uncertainty in Artificial Intelligence, pp.Ā 2181ā2191. PMLR, 2022.
- Random-ltd: Random and layerwise token dropping brings efficient training for large-scale transformers. arXiv preprint arXiv:2211.11586, 2022.
- Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems (NeurIPS), volumeĀ 34, pp.Ā 28877ā28888, 2021.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp.Ā 558ā567, 2021.
- Stabilizing transformer training by preventing attention entropy collapse. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202, pp.Ā 40770ā40803. PMLR, 2023.
- On orthogonality constraints for transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.Ā 375ā382, 2021.
- Towards end-to-end speech recognition with deep convolutional neural networks. Interspeech 2016, 2016.
- DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021a.
- Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714, 2021b.
- Graph neural networks: A review of methods and applications. AI open, 1:57ā81, 2020.
- Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in neural information processing systems (NeurIPS), volumeĀ 32, 2019.
- A simple yet effective svd-gcn for directed graphs. arXiv preprint arXiv:2205.09335, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.