MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms (2209.06800v3)
Abstract: The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN systems optimize the computation and communication individually based on the conventional practice of scaling dense DNNs. For irregularly sparse and fine-grained GNN workloads, such solutions miss the opportunity to jointly schedule/optimize the computation and communication operations for high-performance delivery. To this end, we propose MGG, a novel system design to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its novel dynamic software pipeline to facilitate fine-grained computation-communication overlapping within a GPU kernel. Specifically, MGG introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to facilitate workload balancing and operation overlapping. MGG also incorporates an intelligent runtime design with analytical modeling and optimization heuristics to dynamically improve the execution performance. Extensive evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems across various settings: on average 4.41X, 4.81X, and 10.83X faster than DGL, MGG-UVM, and ROC, respectively.
- Runtime dependency analysis for loop pipelining in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference (DAC), 2013.
- AMD. Rocm openshmem. https://github.com/ROCm-Developer-Tools/ROC_SHMEM.
- Balanced graph partitioning. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures (SPAA), 2004.
- Rabbit order: Just-in-time parallel reordering for fast graph analysis. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
- Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012.
- Dgcl: an efficient communication library for distributed gnn training. In Proceedings of the Sixteenth European Conference on Computer Systems (EuroSys), 2021.
- Link prediction approach to collaborative filtering. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2005.
- FastGCN: Fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations (ICLR), 2018.
- Criteo. Criteo display ad challenge. https://kaggle.com/c/criteodisplay-ad-challenge.
- The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 2011.
- Learning graph representations with embedding propagation. In Advances in neural information processing systems (NeurIPS), 2017.
- P3: Distributed deep graph learning at scale. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2021.
- Graph embedding in vector spaces by node attribute statistics. Pattern Recognition, 2012.
- node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM international conference on Knowledge discovery and data mining (SIGKDD), 2016.
- The architectural implications of facebook’s dnn-based personalized recommendation. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.
- Inductive representation learning on large graphs. In Advances in neural information processing systems (NeurIPS), 2017.
- Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems (NeurIPS), 33, 2020.
- Improving the accuracy, scalability, and performance of graph neural networks with roc. In Proceedings of the 3rd MLSys Conference, 2020.
- Graph classification and clustering based on vector space embedding. World Scientific, 2010.
- Batch-aware unified memory management in gpus for irregular workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020.
- Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR), 2017.
- Learning spectral graph transformations for link prediction. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), 2009.
- SNAP Datasets: Stanford large network dataset collection. https://snap.stanford.edu/data, 2014.
- Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems (TPDS), 2019.
- Pagraph: Scaling gnn training on large graphs via computation-aware caching. In Proceedings of the 11th ACM Symposium on Cloud Computing, 2020.
- Neugraph: parallel deep neural network computation on large graphs. In USENIX Annual Technical Conference (ATC), 2019.
- Pytorch-direct: Enabling gpu centric data access for very large graph neural network training with irregular accesses. arXiv preprint arXiv:2101.07956, 2021.
- Large graph convolutional network training with gpu-oriented data communication architecture. Proc. VLDB Endow., 2021.
- Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), 2019.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning (ICML), 2021.
- Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
- Cloudbank: Managed services to simplify cloud access for computer science research and education. In Practice and Experience in Advanced Research Computing. 2021.
- Nvidia. Dgx superpod. https://nvidia.com/en-us/data-center/dgx-superpod/.
- Nvidia. Nvidia collective communication library (nccl). https://developer.nvidia.com/nccl.
- Nvidia. Nvidia dgx a100. https://nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf.
- Nvidia. Nvshmem communication library. https://developer.nvidia.com/nvshmem.
- Nvidia. Tesla v100. https://nvidia.com/en-us/data-center/v100/.
- NVIDIA. Unified memory for cuda beginners. https://developer.nvidia.com/blog/unified-memory-cuda-beginners/.
- Deepwalk: Online learning of social representations. In The 20th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2014.
- Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In The 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP), 2008.
- Tim Schroeder. Peer-to-peer & unified virtual addressing. https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf.
- Towards time-aware link prediction in evolving social networks. In Proceedings of the 3rd workshop on social network mining and analysis, 2009.
- Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
- Pipad: Pipelined and parallel dynamic gnn training on gpus. 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2023.
- Deep graph library: Towards efficient and scalable deep learning on graphs. ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
- Gnnadvisor: An efficient runtime system for gnn acceleration on gpus. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2021.
- El-rec: efficient large-scale recommendation model training via tensor-train embedding table. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2022.
- wikipedia. Nvidia gpu micro-architecture. https://en.wikipedia.org/wiki/CUDA.
- How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019.
- Hygcn: A gcn accelerator with hybrid architecture. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.
- Gnnlab: a factored system for sample-based gnn training over gpus. In Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys), 2022.
- Hierarchical graph representation learning with differentiable pooling. In The 32nd International Conference on Neural Information Processing Systems (NeurIPS), 2018.
- Exploring the hidden dimension in graph processing. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
- Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 2019.
- Every document owns its structure: Inductive text classification via graph neural networks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
- vpipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel dnn training. IEEE Transactions on Parallel and Distributed Systems (TPDS), 2021.
- Gemini: A computation-centric distributed graph processing system. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
- Yuke Wang (23 papers)
- Boyuan Feng (23 papers)
- Zheng Wang (401 papers)
- Tong Geng (42 papers)
- Kevin Barker (16 papers)
- Ang Li (473 papers)
- Yufei Ding (81 papers)