Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 57 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 453 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs (2311.18141v2)

Published 29 Nov 2023 in cs.DC

Abstract: Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey, “Parallel efficient sparse matrix-matrix multiplication on multicore platforms,” in ISC.   Springer, 2015, pp. 48–57.
  2. E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi,” in PPAM.   Springer, 2013, pp. 559–570.
  3. C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on the GPU,” in EuroPar.   Springer, 2018, pp. 672–687.
  4. G. Schubert, H. Fehske, G. Hager, and G. Wellein, “Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems,” Parallel Processing Letters, vol. 21, no. 03, pp. 339–358, 2011.
  5. S. Acer, O. Selvitopi, and C. Aykanat, “Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,” Parallel Computing, vol. 59, pp. 71–96, 2016.
  6. C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” in PPOPP, 2019, pp. 300–314.
  7. E. Solomonik and T. Hoefler, “Sparse tensor algebra as a parallel programming model,” arXiv preprint arXiv:1512.00066, 2015.
  8. Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, “Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking,” in SPAA, 2020, pp. 293–303.
  9. R. A. Van De Geijn and J. Watts, “Summa: Scalable universal matrix multiplication algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255–274, 1997.
  10. D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-mat: A recursive model for graph mining,” in SDM.   SIAM, 2004, pp. 442–446.
  11. G. M. Slota, S. Rajamanickam, and K. Madduri, “Order or shuffle: Empirically evaluating vertex order impact on parallel graph computations,” in IPDPSW.   IEEE, 2017, pp. 588–597.
  12. A. Azad, A. Buluç, X. S. Li, X. Wang, and J. Langguth, “A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs,” SIAM Journal on Scientific Computing, vol. 42, no. 4, pp. C143–C168, 2020.
  13. G. Huang, G. Dai, Y. Wang, and H. Yang, “GE-SpMM: General-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks,” in SC’20, 2020.
  14. A. Tripathy, K. Yelick, and A. Buluç, “Reducing communication in graph neural network training,” in SC’20, 2020, pp. 1–17.
  15. Y. Hu, Z. Ye, M. Wang, J. Yu, D. Zheng, M. Li, Z. Zhang, Z. Zhang, and Y. Wang, “FeatGraph: A flexible and efficient backend for graph neural network systems,” in SC’20, 2020.
  16. A. Buluç and J. R. Gilbert, “The Combinatorial BLAS: Design, implementation, and applications,” The Intl. Journal of High Performance Comp. Applications, vol. 25, no. 4, pp. 496–509, 2011.
  17. A. Azad, A. Buluç, and J. Gilbert, “Parallel triangle counting and enumeration using matrix algebra,” in IPDPSW.   IEEE, 2015, pp. 804–811.
  18. S. v. Dongen, “Graph clustering by flow simulation,” PhD thesis, University of Utrecht, 2000.
  19. A. Bustamam, K. Burrage, and N. A. Hamilton, “Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format,” IEEE/ACM TCBB, vol. 9, no. 3, pp. 679–692, 2012.
  20. T. A. Davis, “Graph algorithms via suitesparse: Graphblas: triangle counting and k-truss,” in HPEC.   IEEE, 2018, pp. 1–6.
  21. U. Borštnik, J. VandeVondele, V. Weber, and J. Hutter, “Sparse matrix multiplication: The distributed block-compressed sparse row library,” Parallel Computing, vol. 40, no. 5-6, pp. 47–58, 2014.
  22. B. Brock, A. Buluç, and K. Yelick, “BCL: A cross-platform distributed data structures library,” in ICPP, 2019.
  23. J. Bachan, D. Bonachea, P. H. Hargrove, S. Hofmeyr, M. Jacquelin, A. Kamil, B. van Straalen, and S. B. Baden, “The UPC++ PGAS library for exascale computing,” in Proceedings of the Second Annual PGAS Applications Workshop.   ACM, 2017, p. 7.
  24. K. Fürlinger, T. Fuchs, and R. Kowalewski, “DASH: A C++ PGAS library for distributed data structures and parallel algorithms,” in HPCC, Sydney, Australia, Dec. 2016, pp. 983–990.
  25. S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, “PETSc Web page,” https://www.mcs.anl.gov/petsc, 2021. [Online]. Available: https://www.mcs.anl.gov/petsc
  26. S. van Dongen, “Graph clustering by flow simulation,” Ph.D. dissertation, University of Utrecht, 2000.
  27. G. Guidi, O. Selvitopi, M. Ellis, L. Oliker, K. Yelick, and A. Buluc, “Parallel string graph construction and transitive reduction for de novo genome assembly,” in IPDPS.   IEEE, 2021.
  28. F. Mössbauer, R. Kowalewski, T. Fuchs, and K. Fürlinger, “A portable multidimensional coarray for C++,” in PDP, Cambridge, UK, Mar. 2018.
  29. A. Azad, O. Selvitopi, M. T. Hussain, J. Gilbert, and A. Buluç, “Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems,” IEEE Transactions on Parallel and Distributed Systems, 2021.
  30. E. Solomonik, D. Matthews, J. Hammond, and J. Demmel, “Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions,” in IPDPS.   IEEE, 2013, pp. 813–824.
  31. A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams, “Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication,” SIAM Journal on Scientific Computing, vol. 38, no. 6, pp. C624–C651, 2016.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube