Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Sparsity in Graph Transformers (2312.05479v1)

Published 9 Dec 2023 in cs.LG and cs.AI

Abstract: Graph Transformers (GTs) have achieved impressive results on various graph-related tasks. However, the huge computational cost of GTs hinders their deployment and application, especially in resource-constrained environments. Therefore, in this paper, we explore the feasibility of sparsifying GTs, a significant yet under-explored topic. We first discuss the redundancy of GTs based on the characteristics of existing GT models, and then propose a comprehensive \textbf{G}raph \textbf{T}ransformer \textbf{SP}arsification (GTSP) framework that helps to reduce the computational complexity of GTs from four dimensions: the input graph data, attention heads, model layers, and model weights. Specifically, GTSP designs differentiable masks for each individual compressible component, enabling effective end-to-end pruning. We examine our GTSP through extensive experiments on prominent GTs, including GraphTrans, Graphormer, and GraphGPS. The experimental results substantiate that GTSP effectively cuts computational costs, accompanied by only marginal decreases in accuracy or, in some cases, even improvements. For instance, GTSP yields a reduction of 30\% in Floating Point Operations while contributing to a 1.8\% increase in Area Under the Curve accuracy on OGBG-HIV dataset. Furthermore, we provide several insights on the characteristics of attention heads and the behavior of attention mechanisms, all of which have immense potential to inspire future research endeavors in this domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. On Attention Redundancy: A Comprehensive Study. In ACL.
  2. NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs. In ICLR.
  3. Chasing sparsity in vision transformers: An end-to-end exploration. NeurIPS, 34: 19974–19988.
  4. A unified lottery ticket hypothesis for graph neural networks. In ICML.
  5. What Does BERT Look at? An Analysis of BERT’s Attention. In ACL Workshop.
  6. Analyzing Redundancy in Pretrained Transformer Models. In EMNLP.
  7. A Generalization of Transformer Networks to Graphs. AAAI Workshop.
  8. Rigging the lottery: Making all tickets winners. In ICML.
  9. Reducing Transformer Depth on Demand with Structured Dropout. In ICLR.
  10. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In ICLR.
  11. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. JMLR, 22(1): 10882–11005.
  12. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv:2005.00687.
  13. Deep networks with stochastic depth. In ECCV.
  14. Rethinking Graph Lottery Tickets: Graph Sparsity Matters. In ICLR.
  15. Global Self-Attention as a Replacement for Graph Convolution. In SIGKDD.
  16. Learned Token Pruning for Transformers. In SIGKDD.
  17. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
  18. Similarity of Neural Network Representations Revisited. In ICML.
  19. Rethinking Graph Transformers with Spectral Attention. In NeurIPS.
  20. EViT: Expediting Vision Transformers via Token Reorganizations. In ICLR.
  21. Comprehensive Graph Gradual Pruning for Sparse Training in Graph Neural Networks. IEEE TNNLS.
  22. Sparse training via boosting pruning plasticity with neuroregeneration. In NeurIPS.
  23. Survey on Graph Neural Network Acceleration: An Algorithmic Perspective. In IJCAI.
  24. Are sixteen heads really better than one? In NeurIPS.
  25. Transformer for Graphs: An Overview from Architecture Perspective. arXiv:2202.08455.
  26. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications.
  27. Recipe for a General, Powerful, Scalable Graph Transformer. In NeurIPS.
  28. Self-Supervised Graph Transformer on Large-Scale Molecular Data. In NeurIPS.
  29. Measuring and testing dependence by correlation of distances. The Annals of Statistics.
  30. Attention is All you Need. In NeurIPS.
  31. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In ACL.
  32. Representing Long-Range Context for Graph Neural Networks with Global Attention. In NeurIPS.
  33. Do Transformers Really Perform Badly for Graph Representation? In NeurIPS.
  34. Early-Bird GCNs: Graph-Network Co-Optimization Towards More Efficient GCN Training and Inference via Drawing Early-Bird Lottery Tickets. In AAAI.
  35. Width & depth pruning for vision transformers. In AAAI.
  36. Hierarchical graph transformer with adaptive node sampling. In NeurIPS.
  37. Are More Layers Beneficial to Graph Transformers? In ICLR.
  38. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv:1710.01878.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chuang Liu (71 papers)
  2. Yibing Zhan (73 papers)
  3. Xueqi Ma (13 papers)
  4. Liang Ding (159 papers)
  5. Dapeng Tao (28 papers)
  6. Jia Wu (93 papers)
  7. Wenbin Hu (50 papers)
  8. Bo Du (264 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.