Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs (2311.18141v2)

Published 29 Nov 2023 in cs.DC

Abstract: Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey, “Parallel efficient sparse matrix-matrix multiplication on multicore platforms,” in ISC.   Springer, 2015, pp. 48–57.
  2. E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi,” in PPAM.   Springer, 2013, pp. 559–570.
  3. C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on the GPU,” in EuroPar.   Springer, 2018, pp. 672–687.
  4. G. Schubert, H. Fehske, G. Hager, and G. Wellein, “Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems,” Parallel Processing Letters, vol. 21, no. 03, pp. 339–358, 2011.
  5. S. Acer, O. Selvitopi, and C. Aykanat, “Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,” Parallel Computing, vol. 59, pp. 71–96, 2016.
  6. C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” in PPOPP, 2019, pp. 300–314.
  7. E. Solomonik and T. Hoefler, “Sparse tensor algebra as a parallel programming model,” arXiv preprint arXiv:1512.00066, 2015.
  8. Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, “Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking,” in SPAA, 2020, pp. 293–303.
  9. R. A. Van De Geijn and J. Watts, “Summa: Scalable universal matrix multiplication algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255–274, 1997.
  10. D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-mat: A recursive model for graph mining,” in SDM.   SIAM, 2004, pp. 442–446.
  11. G. M. Slota, S. Rajamanickam, and K. Madduri, “Order or shuffle: Empirically evaluating vertex order impact on parallel graph computations,” in IPDPSW.   IEEE, 2017, pp. 588–597.
  12. A. Azad, A. Buluç, X. S. Li, X. Wang, and J. Langguth, “A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs,” SIAM Journal on Scientific Computing, vol. 42, no. 4, pp. C143–C168, 2020.
  13. G. Huang, G. Dai, Y. Wang, and H. Yang, “GE-SpMM: General-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks,” in SC’20, 2020.
  14. A. Tripathy, K. Yelick, and A. Buluç, “Reducing communication in graph neural network training,” in SC’20, 2020, pp. 1–17.
  15. Y. Hu, Z. Ye, M. Wang, J. Yu, D. Zheng, M. Li, Z. Zhang, Z. Zhang, and Y. Wang, “FeatGraph: A flexible and efficient backend for graph neural network systems,” in SC’20, 2020.
  16. A. Buluç and J. R. Gilbert, “The Combinatorial BLAS: Design, implementation, and applications,” The Intl. Journal of High Performance Comp. Applications, vol. 25, no. 4, pp. 496–509, 2011.
  17. A. Azad, A. Buluç, and J. Gilbert, “Parallel triangle counting and enumeration using matrix algebra,” in IPDPSW.   IEEE, 2015, pp. 804–811.
  18. S. v. Dongen, “Graph clustering by flow simulation,” PhD thesis, University of Utrecht, 2000.
  19. A. Bustamam, K. Burrage, and N. A. Hamilton, “Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format,” IEEE/ACM TCBB, vol. 9, no. 3, pp. 679–692, 2012.
  20. T. A. Davis, “Graph algorithms via suitesparse: Graphblas: triangle counting and k-truss,” in HPEC.   IEEE, 2018, pp. 1–6.
  21. U. Borštnik, J. VandeVondele, V. Weber, and J. Hutter, “Sparse matrix multiplication: The distributed block-compressed sparse row library,” Parallel Computing, vol. 40, no. 5-6, pp. 47–58, 2014.
  22. B. Brock, A. Buluç, and K. Yelick, “BCL: A cross-platform distributed data structures library,” in ICPP, 2019.
  23. J. Bachan, D. Bonachea, P. H. Hargrove, S. Hofmeyr, M. Jacquelin, A. Kamil, B. van Straalen, and S. B. Baden, “The UPC++ PGAS library for exascale computing,” in Proceedings of the Second Annual PGAS Applications Workshop.   ACM, 2017, p. 7.
  24. K. Fürlinger, T. Fuchs, and R. Kowalewski, “DASH: A C++ PGAS library for distributed data structures and parallel algorithms,” in HPCC, Sydney, Australia, Dec. 2016, pp. 983–990.
  25. S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, “PETSc Web page,” https://www.mcs.anl.gov/petsc, 2021. [Online]. Available: https://www.mcs.anl.gov/petsc
  26. S. van Dongen, “Graph clustering by flow simulation,” Ph.D. dissertation, University of Utrecht, 2000.
  27. G. Guidi, O. Selvitopi, M. Ellis, L. Oliker, K. Yelick, and A. Buluc, “Parallel string graph construction and transitive reduction for de novo genome assembly,” in IPDPS.   IEEE, 2021.
  28. F. Mössbauer, R. Kowalewski, T. Fuchs, and K. Fürlinger, “A portable multidimensional coarray for C++,” in PDP, Cambridge, UK, Mar. 2018.
  29. A. Azad, O. Selvitopi, M. T. Hussain, J. Gilbert, and A. Buluç, “Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems,” IEEE Transactions on Parallel and Distributed Systems, 2021.
  30. E. Solomonik, D. Matthews, J. Hammond, and J. Demmel, “Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions,” in IPDPS.   IEEE, 2013, pp. 813–824.
  31. A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams, “Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication,” SIAM Journal on Scientific Computing, vol. 38, no. 6, pp. C624–C651, 2016.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.