Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Fully Offloaded GPU Stream-Aware Message Passing (2306.15773v1)

Published 27 Jun 2023 in cs.DC, cs.NI, and cs.PF

Abstract: Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the CPU to orchestrate the data transfer operations. A new offload-friendly communication strategy, stream-triggered (ST) communication, was explored to allow offloading the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization based implementation was used as an exemplar to illustrate the proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used to explore the various performance aspects of the implementation. The offloaded implementation shows significant on-node performance advantages over standard MPI active RMA (36%) and point-to-point (61%) communication. The current multi-node improvement is less (23% faster than standard active RMA but 11% slower than point-to-point), but plans are in progress to purse further improvements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. [n. d.]. HPE Slingshot Interconnect. https://www.hpe.com/in/en/compute/hpc/slingshot-interconnect.html.
  2. 2013. NVIDIA CUDA C/C++ Streams and Concurrency. http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf.
  3. 2017. CORAL-2 Benchmarks Summary. https://asc.llnl.gov/coral-2-benchmarks.
  4. 2017. OpenSHMEM standard version-1.4. http://openshmem.org/site/sites/default/site_files/OpenSHMEM-1.4.pdf.
  5. 2020. NVIDIA GPUDirect libgdsync. https://github.com/gpudirect/libgdsync.
  6. 2022. CORAL-2 Benchmarks Summary - Nekbone. https://asc.llnl.gov/sites/asc/files/2020-06/Nekbone_Summary_v2.3.4.1.pdf.
  7. 2022. Cray’s Slingshot Interconnect is at the Heart of HPE’s HPC and AI Ambitions. https://tinyurl.com/22rt7utz.
  8. 2023. AMD ROCr-Runtime Manpage and Guide. https://rocmdocs.amd.com/en/latest/Installation_Guide/ROCR-Runtime.html.
  9. 2023. IB Verbs RDMA programming guide. https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/RDMA+Aware+Networks+Programming+User+Manual+v1.7.
  10. 2023. Libfabric Deferred Work Queue. https://ofiwg.github.io/libfabric/v1.9.1/man/fi_trigger.3.html.
  11. 2023. NVIDIA GPUDirect family. https://developer.nvidia.com/gpudirect.
  12. Offloading Communication Control Logic in GPU Accelerated Applications. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 248–257. https://doi.org/10.1109/CCGRID.2017.29
  13. Using Triggered Operations to Offload Rendezvous Messages. In Proceedings of the 18th European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface (EuroMPI’11).
  14. Adaptive and Dynamic Design for MPI Tag Matching. 2016 IEEE International Conference on Cluster Computing (CLUSTER) (2016), 1–10.
  15. QsNet II : An Interconnect for Supercomputing Applications *.
  16. Sierra Center of Excellence: Lessons learned. IBM Journal of Research and Development (2020).
  17. GPUrdma: GPU-Side Library for High Performance Networking from GPU Kernels. In Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers (Kyoto, Japan) (ROSS ’16). Association for Computing Machinery, Article 6, 8 pages.
  18. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
  19. Implementation and evaluation of MPI 4.0 partitioned communication libraries. Parallel Comput. (2021).
  20. Hardware MPI message matching: Insights into MPI matching behavior to inform design: Hardware MPI message matching. Concurrency and Computation. Practice and Experience 32 (2019).
  21. Mitigating MPI Message Matching Misery. In Information Security Conference.
  22. Message Passing Forum. 1994. MPI: A Message-Passing Interface Standard. Technical Report.
  23. Finepoints: Partitioned multithreaded MPI communication. In High Performance Computing: 34th International Conference, ISC High Performance 2019.
  24. A System Software Architecture for High-End Computing. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (San Jose, CA) (SC ’97). 1–15.
  25. Not all applications have boring communication patterns: Profiling message matching with BMM. Concurrency and Computation: Practice and Experience (jun 2021). https://doi.org/10.1002%2Fcpe.6380
  26. A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency.
  27. Khaled Hamidouche and Michael LeBeane. 2020. GPU INitiated OPenSHMEM: Correct and Efficient Intra-Kernel Networking for DGPUs. Association for Computing Machinery.
  28. William Hanson. 2019. The CORAL supercomputer systems. IBM Journal of Research and Development (2019).
  29. Optimization of MPI Persistent Communication. In Proceedings of the 20th European MPI Users’ Group Meeting (EuroMPI ’13).
  30. Using Triggered Operations to Offload Collective Communication Operations. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface (Stuttgart, Germany) (EuroMPI’10). 249–256.
  31. An architecture to perform NIC based MPI matching. In 2007 IEEE International Conference on Cluster Computing. https://doi.org/10.1109/CLUSTR.2007.4629234
  32. Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 2, 2 (June 2015), 9:1–9:26. http://doi.acm.org/10.1145/2780584
  33. Remote Memory Access Programming in MPI-3. ACM Transactions on Parallel Computing (2015).
  34. An Initial Assessment of NVSHMEM for High Performance Computing. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). https://doi.org/10.1109/IPDPSW50202.2020.00104
  35. Revisiting Persistent Communication in MPI. In Proceedings of the 19th European Conference on Recent Advances in the Message Passing Interface (EuroMPI’12).
  36. Scaling the Summit: Deploying the World’s Fastest Supercomputer. In ISC Workshops.
  37. COMB: a portable benchmark suite for assessing MPI overlap. In Proceedings. IEEE International Conference on Cluster Computing.
  38. Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs (GPGPU ’19).
  39. Exploring GPU Stream-Aware Message Passing using Triggered Operations. arXiv:2208.04817
  40. Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops. https://doi.org/10.1109/IPDPSW.2014.111
  41. The Quadrics Network: High-Performance Clustering Technology. IEEE Micro (2002).
  42. Hardware- and software-based collective communication on the Quadrics network. In Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.
  43. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In 2013 42nd International Conference on Parallel Processing.
  44. Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
  45. David Schneider. 2022. The Exascale Era is Upon Us: The Frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectrum (2022).
  46. Measuring Multithreaded Message Matching Misery. In European Conference on Parallel Processing.
  47. K.D. Underwood and R. Brightwell. 2004. The impact of MPI queue usage on message latency. In International Conference on Parallel Processing, 2004. ICPP 2004.
  48. Enabling Flexible Collective Communication Offload with Triggered Operations. In 2011 IEEE 19th Annual Symposium on High Performance Interconnects.
  49. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation. IEEE Transactions on Parallel and Distributed Systems 25 (2014).
  50. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science - Research and Development 26 (2011).
  51. MPIX Stream: An Explicit Solution to Hybrid MPI+ X Programming. In Proceedings of the 29th European MPI Users’ Group Meeting.
Citations (1)

Summary

We haven't generated a summary for this paper yet.