GeoT: Tensor Centric Library for Graph Neural Network via Efficient Segment Reduction on GPU (2404.03019v2)
Abstract: In recent years, Graph Neural Networks (GNNs) have ignited a surge of innovation, significantly enhancing the processing of geometric data structures such as graphs, point clouds, and meshes. As the domain continues to evolve, a series of frameworks and libraries are being developed to push GNN efficiency to new heights. While graph-centric libraries have achieved success in the past, the advent of efficient tensor compilers has highlighted the urgent need for tensor-centric libraries. Yet, efficient tensor-centric frameworks for GNNs remain scarce due to unique challenges and limitations encountered when implementing segment reduction in GNN contexts. We introduce GeoT, a cutting-edge tensor-centric library designed specifically for GNNs via efficient segment reduction. GeoT debuts innovative parallel algorithms that not only introduce new design principles but also expand the available design space. Importantly, GeoT is engineered for straightforward fusion within a computation graph, ensuring compatibility with contemporary tensor-centric machine learning frameworks and compilers. Setting a new performance benchmark, GeoT marks a considerable advancement by showcasing an average operator speedup of 1.80x and an end-to-end speedup of 1.68x.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
- M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
- N. Verma, E. Boyer, and J. Verbeek, “Feastnet: Feature-steered graph convolutions for 3d shape analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2598–2606.
- T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia, “Learning mesh-based simulation with graph networks,” arXiv preprint arXiv:2010.03409, 2020.
- Y. Siddiqui, A. Alliegro, A. Artemov, T. Tommasi, D. Sirigatti, V. Rosov, A. Dai, and M. Nießner, “Meshgpt: Generating triangle meshes with decoder-only transformers,” arXiv preprint arXiv:2311.15475, 2023.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
- J. Gasteiger, F. Becker, and S. Günnemann, “Gemnet: Universal directional graph neural networks for molecules,” Advances in Neural Information Processing Systems, vol. 34, pp. 6790–6802, 2021.
- J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A. J. Borst, R. J. Ragotte, L. F. Milles et al., “Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models,” BioRxiv, pp. 2022–12, 2022.
- A. Pal, C. Eksombatchai, Y. Zhou, B. Zhao, C. Rosenberg, and J. Leskovec, “Pinnersage: Multi-modal user embedding framework for recommendations at pinterest,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 2311–2320.
- W. Jiang and J. Luo, “Graph neural network for traffic forecasting: A survey,” Expert Systems with Applications, vol. 207, p. 117921, 2022.
- R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu et al., “Graphcast: Learning skillful medium-range global weather forecasting,” arXiv preprint arXiv:2212.12794, 2022.
- D. Merrill, “Cub,” NVIDIA Research, 2015.
- N. Bell and J. Hoberock, “Thrust: A productivity-oriented library for cuda,” in GPU computing gems Jade edition. Elsevier, 2012, pp. 359–371.
- M. Fey, “PyTorch Scatter: Extension Library for PyTorch,” https://github.com/rusty1s/pytorch_scatter, 2020.
- G. Dai, G. Huang, S. Yang, Z. Yu, H. Zhang, Y. Ding, Y. Xie, H. Yang, and Y. Wang, “Heuristic adaptability to input dynamics for spmm on gpus,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 595–600.
- M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019.
- J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning. PMLR, 2017, pp. 1263–1272.
- W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
- V. G. Satorras, E. Hoogeboom, and M. Welling, “E (n) equivariant graph neural networks,” in International conference on machine learning. PMLR, 2021, pp. 9323–9332.
- G. Huang, G. Dai, Y. Wang, and H. Yang, “Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–12.
- M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai et al., “Deep graph library: A graph-centric, highly-performant package for graph neural networks,” arXiv preprint arXiv:1909.01315, 2019.
- L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai, “{{\{{NeuGraph}}\}}: Parallel deep neural network computation on large graphs,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 443–458.
- Y. Wang, B. Feng, G. Li, S. Li, L. Deng, Y. Xie, and Y. Ding, “{{\{{GNNAdvisor}}\}}: An adaptive and efficient runtime system for {{\{{GNN}}\}} acceleration on {{\{{GPUs}}\}},” in 15th USENIX symposium on operating systems design and implementation (OSDI 21), 2021, pp. 515–531.
- H. Liu, S. Lu, X. Chen, and B. He, “G3: when graph neural networks meet parallel graph processing systems on gpus,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 2813–2816, 2020.
- T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze et al., “{{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594.
- S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C. H. Yu, Y. Yu et al., “Tensorir: An abstraction for automatic tensorized program optimization,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 804–817.
- Google, “XLA: Accelerated Linear Algebra,” 2017. [Online]. Available: https://www.tensorflow.org/xla
- P. Tillet, H.-T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” in Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019, pp. 10–19.
- V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou et al., “Mlperf inference benchmark,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 446–459.
- ——, “Cooperative groups: A modern approach to parallelism on cuda,” https://developer.nvidia.com/blog/cooperative-groups/, December 2017.
- ——, “Cutlass: Linear algebra for cuda,” https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/, September 2017.
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, “Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,” Acm Sigplan Notices, vol. 48, no. 6, pp. 519–530, 2013.
- C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on the gpu,” in European Conference on Parallel Processing. Springer, 2018, pp. 672–687.
- C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019, pp. 300–314.
- Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “Sparsetir: Composable abstractions for sparse compilation in deep learning,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 660–678.
- F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The tensor algebra compiler,” Proceedings of the ACM on Programming Languages, vol. 1, no. OOPSLA, pp. 1–29, 2017.
- Z. Yu, G. Dai, S. Yang, G. Zhang, H. Zhang, F. Zhu, J. Yang, J. Zhao, and Y. Wang, “Hypergef: A framework enabling efficient fusion for hypergraph neural network on gpus,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
- N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in Proceedings of the conference on high performance computing networking, storage and analysis, 2009, pp. 1–11.
- G. Zhang, Y. Zhao, Y. Tao, Z. Yu, G. Dai, S. Huang, Y. Wen, P. Petoumenos, and Y. Wang, “Sgap: towards efficient sparse tensor algebra compilation for gpu,” CCF Transactions on High Performance Computing, vol. 5, no. 2, pp. 210–227, 2023.
- NVIDIA, P. Vingelmann, and F. H. Fitzek, “Cuda, release: 10.2.89,” 2020. [Online]. Available: https://developer.nvidia.com/cuda-toolkit
- “Intel(r) oneapi deep neural network library (onednn),” https://github.com/oneapi-src/oneDNN.
- “Cublas: Basic linear algebra on nvidia gpus,” https://developer.nvidia.com/cublas.
- “Intel(r) oneapi math kernel library (onemkl),” https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html#gs.6a5nnq.
- T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learning to optimize tensor programs,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen et al., “Ansor: Generating {{\{{High-Performance}}\}} tensor programs for deep learning,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 863–879.
- B. H. Ahn, P. Pilligundla, A. Yazdanbakhsh, and H. Esmaeilzadeh, “Chameleon: Adaptive code optimization for expedited deep neural network compilation,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=rygG4AVFvH
- A. Adams, K. Ma, L. Anderson, R. Baghdadi, T.-M. Li, M. Gharbi, B. Steiner, S. Johnson, K. Fatahalian, F. Durand, and J. Ragan-Kelley, “Learning to optimize halide with tree search and random programs,” ACM Trans. Graph., vol. 38, no. 4, jul 2019. [Online]. Available: https://doi.org/10.1145/3306346.3322967
- J. Ryu, E. Park, and H. Sung, “One-shot tuner for deep learning compilers,” in Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, ser. CC 2022. New York, NY, USA: Association for Computing Machinery, 2022, p. 89–103. [Online]. Available: https://doi.org/10.1145/3497776.3517774
- J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, pp. 81–106, 1986.
- Intel Corporation, “Intel® core™ i9-13900k processor (36m cache, up to 5.80 ghz) specifications,” https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html, 2022.
- N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N. M. Amato, and L. Rauchwerger, “A framework for adaptive algorithm selection in stapl,” in Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, 2005, pp. 277–288.
- scikit-learn Developers, “Multi-output decision tree regression,” https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression_multioutput.html, 2024.
- N. Fauzia, L.-N. Pouchet, and P. Sadayappan, “Characterizing and enhancing global memory data coalescing on gpus,” in 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2015, pp. 12–22.
- P. Wu, “Pytorch 2.0: The journey to bringing compiler technologies to the core of pytorch (keynote),” in Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, 2023, pp. 1–1.
- W. Niu, J. Guan, Y. Wang, G. Agrawal, and B. Ren, “Dnnfusion: accelerating deep neural networks execution with advanced operator fusion,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 883–898.
- Y. Ding, C. H. Yu, B. Zheng, Y. Liu, Y. Wang, and G. Pekhimenko, “Hidet: Task-mapping programming paradigm for deep learning tensor programs,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 370–384.
- R. Lai, J. Shao, S. Feng, S. S. Lyubomirsky, B. Hou, W. Lin, Z. Ye, H. Jin, Y. Jin, J. Liu et al., “Relax: Composable abstractions for end-to-end dynamic machine learning,” arXiv preprint arXiv:2311.02103, 2023.
- C. Zhao, G. Zhang, and M. Gao, “Canvas: End-to-end kernel architecture search in neural networks,” arXiv preprint arXiv:2304.07741, 2023.
- P. Fegade, T. Chen, P. Gibbons, and T. Mowry, “The cora tensor compiler: Compilation for ragged tensors with minimal padding,” Proceedings of Machine Learning and Systems, vol. 4, pp. 721–747, 2022.
- T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14.
- Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, “Gunrock: A high-performance graph processing library on the gpu,” in Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming, 2016, pp. 1–12.
- Z. Xie, M. Wang, Z. Ye, Z. Zhang, and R. Fan, “Graphiler: Optimizing graph neural networks with message passing data flow graph,” Proceedings of Machine Learning and Systems, vol. 4, pp. 515–528, 2022.
- Y. Gui, Y. Wu, H. Yang, T. Jin, B. Li, Q. Zhou, J. Cheng, and F. Yu, “Hgl: accelerating heterogeneous gnn training with holistic representation and optimization,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15.
- Y. Wu, K. Ma, Z. Cai, T. Jin, B. Li, C. Zheng, J. Cheng, and F. Yu, “Seastar: vertex-centric programming for graph neural networks,” in Proceedings of the Sixteenth European Conference on Computer Systems, 2021, pp. 359–375.
- K. Wu, M. Hidayetoğlu, X. Song, S. Huang, D. Zheng, I. Nisa, and W.-m. Hwu, “Pigeon: Optimizing cuda code generator for end-to-end training and inference of relational graph neural networks,” arXiv preprint arXiv:2301.06284, 2023.
- Z. Chen, M. Yan, M. Zhu, L. Deng, G. Li, S. Li, and Y. Xie, “fusegnn: Accelerating graph convolutional neural network training on gpgpu,” in Proceedings of the 39th International Conference on Computer-Aided Design, 2020, pp. 1–9.
- M. K. Rahman, M. H. Sujon, and A. Azad, “Fusedmm: A unified sddmm-spmm kernel for graph embedding and graph neural networks,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 256–266.
- H. Zhang, Z. Yu, G. Dai, G. Huang, Y. Ding, Y. Xie, and Y. Wang, “Understanding gnn computational graph: A coordinated computation, io, and memory perspective,” Proceedings of Machine Learning and Systems, vol. 4, pp. 467–484, 2022.
- K. Huang, J. Zhai, Z. Zheng, Y. Yi, and X. Shen, “Understanding and bridging the gaps in current gnn performance optimizations,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 119–132.
- C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe, “Weisfeiler and leman go neural: Higher-order graph neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 4602–4609.
- M. Fey, “pytorch_sparse: PyTorch Extension Library for Sparse Tensor Operations,” https://github.com/rusty1s/pytorch_sparse, 2023.
- Y. Bai, C. Li, Z. Lin, Y. Wu, Y. Miao, Y. Liu, and Y. Xu, “Efficient data loader for fast sampling-based gnn training on large graphs,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 10, pp. 2541–2556, 2021.
- J. Yang, D. Tang, X. Song, L. Wang, Q. Yin, R. Chen, W. Yu, and J. Zhou, “Gnnlab: a factored system for sample-based gnn training over gpus,” in Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 417–434.
- M. Serafini and H. Guan, “Scalable graph neural network training: The case for sampling,” ACM SIGOPS Operating Systems Review, vol. 55, no. 1, pp. 68–76, 2021.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
- NVIDIA Corporation, “Nvidia a100 tensor core gpu,” https://www.nvidia.com/en-us/data-center/a100/, 2020.
- ——, “Nvidia h100 tensor core gpu,” https://www.nvidia.com/en-us/data-center/h100/, 2022.
- ——, “Nvidia geforce rtx 3090 and rtx 3090 ti graphics cards,” https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3090-3090ti/, 2022.
- W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” Advances in neural information processing systems, vol. 33, pp. 22 118–22 133, 2020.
- F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning. PMLR, 2019, pp. 6861–6871.
- H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. Prasanna, “Graphsaint: Graph sampling based inductive learning method,” arXiv preprint arXiv:1907.04931, 2019.
- A. Bojchevski and S. Günnemann, “Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking,” arXiv preprint arXiv:1707.03815, 2017.
- M. Zitnik and J. Leskovec, “Predicting multicellular function through multi-layer tissue networks,” Bioinformatics, vol. 33, no. 14, pp. i190–i198, 2017.
- O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls of graph neural network evaluation,” arXiv preprint arXiv:1811.05868, 2018.
- K. Wang, Z. Shen, C. Huang, C.-H. Wu, Y. Dong, and A. Kanakia, “Microsoft academic graph: When experts are not enough,” Quantitative Science Studies, vol. 1, no. 1, pp. 396–413, 2020.
- K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
- M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” in GPU Technology Conference, 2010.
- PyTorch Geometric Contributors, “Memory-efficient aggregations in pytorch geometric,” https://pytorch-geometric.readthedocs.io/en/latest/advanced/sparse_tensor.html, 2024.
- W. Gu, A. Tandon, Y.-Y. Ahn, and F. Radicchi, “Principled approach to the selection of the embedding dimension of networks,” Nature Communications, vol. 12, no. 1, p. 3772, 2021.
- I. Nisa, A. Sukumaran-Rajam, S. E. Kurt, C. Hong, and P. Sadayappan, “Sampled dense matrix multiplication for high-performance machine learning,” in 2018 IEEE 25th International Conference on High Performance Computing (HiPC). IEEE, 2018, pp. 32–41.
- Z. Yu, G. Dai, G. Huang, Y. Wang, and H. Yang, “Exploiting online locality and reduction parallelism for sampled dense matrix multiplication on gpus,” in 2021 IEEE 39th International Conference on Computer Design (ICCD). IEEE, 2021, pp. 567–574.
- Zhongming Yu (11 papers)
- Genghan Zhang (9 papers)
- Hanxian Huang (10 papers)
- Xin Chen (457 papers)
- Jishen Zhao (24 papers)