JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication (2312.05639v1)
Abstract: Achieving high performance for Sparse MatrixMatrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the aheadof-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSPMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSPMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSPMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSPMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSPMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSPMM provides an average improvement of 3.8x and 1.4x, respectively.
- A. Bik, P. Koanantakool, T. Shpeisman, N. Vasilache, B. Zheng, and F. Kjolstad, “Compiler support for sparse tensor computations in mlir,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 19, no. 4, pp. 1–25, 2022.
- K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “Extensor: An accelerator for sparse tensor algebra,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 319–333.
- Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “Sparsetir: Composable abstractions for sparse compilation in deep learning,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 660–678.
- S. Chou and S. Amarasinghe, “Compilation of dynamic sparse tensor algebra,” Proceedings of the ACM on Programming Languages, vol. 6, no. OOPSLA2, pp. 1408–1437, 2022.
- Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
- S. E. Schaeffer, “Graph clustering,” Computer science review, vol. 1, no. 1, pp. 27–64, 2007.
- S. Abadal, A. Jain, R. Guirado, J. López-Alonso, and E. Alarcón, “Computing graph neural networks: A survey from algorithms to accelerators,” ACM Computing Surveys (CSUR), vol. 54, no. 9, pp. 1–38, 2021.
- M. Y. Wang, “Deep graph library: Towards efficient and scalable deep learning on graphs,” in ICLR workshop on representation learning on graphs and manifolds, 2019.
- Q. Fu, Y. Ji, and H. H. Huang, “Tlpgnn: A lightweight two-level parallelism paradigm for graph neural network computation on gpu,” in Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, 2022, pp. 122–134.
- Q. Fu and H. H. Huang, “Automatic generation of high-performance inference kernels for graph neural networks on multi-core systems,” in Proceedings of the 50th International Conference on Parallel Processing, 2021, pp. 1–11.
- P. Yin, X. Yan, J. Zhou, Q. Fu, Z. Cai, J. Cheng, B. Tang, and M. Wang, “Dgi: An easy and efficient framework for gnn model evaluation,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 5439–5450.
- Y. Rong, W. Huang, T. Xu, and J. Huang, “Dropedge: Towards deep graph convolutional networks on node classification,” arXiv preprint arXiv:1907.10903, 2019.
- M. Zhang and Y. Chen, “Link prediction based on graph neural networks,” Advances in neural information processing systems, vol. 31, 2018.
- J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI open, vol. 1, pp. 57–81, 2020.
- V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson, “Benchmarking graph neural networks,” Journal of Machine Learning Research, vol. 24, no. 43, pp. 1–48, 2023.
- Q. Fu, B. Feng, D. Guo, and Q. Li, “Combating the evolving spammers in online social networks,” Computers & Security, vol. 72, pp. 60–73, 2018.
- E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, Y. Wang, E. Wang, Q. Zhang, B. Shen et al., “Intel math kernel library,” High-Performance Computing on the Intel® Xeon Phi™: How to Fully Exploit MIC Architectures, pp. 167–188, 2014.
- C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on the gpu,” in Euro-Par 2018: Parallel Processing: 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings. Springer, 2018, pp. 672–687.
- D. Merrill and M. Garland, “Merge-based parallel sparse matrix-vector multiplication,” in SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2016, pp. 678–689.
- S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk, “Merge path-parallel merging made simple,” in 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, 2012, pp. 1611–1618.
- P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright et al., “Scipy 1.0: fundamental algorithms for scientific computing in python,” Nature methods, vol. 17, no. 3, pp. 261–272, 2020.
- M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019.
- O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Buluç, “Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication,” in Proceedings of the ACM International Conference on Supercomputing, 2021, pp. 431–442.
- M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 359–371.
- G. Huang, G. Dai, Y. Wang, and H. Yang, “Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–12.
- G. Ortega, F. Vázquez, I. García, and E. M. Garzón, “Fastspmm: An efficient library for sparse matrix matrix product on gpus,” The Computer Journal, vol. 57, no. 7, pp. 968–979, 2014.
- Z. Gong, H. Ji, Y. Yao, C. W. Fletcher, C. J. Hughes, and J. Torrellas, “Graphite: optimizing graph neural networks on cpus through cooperative software-hardware techniques,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 916–931.
- Y. Liu, Y. Wang, R. Yu, M. Li, V. Sharma, and Y. Wang, “Optimizing CNN model inference on CPUs,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19). Renton, WA: USENIX Association, Jul. 2019, pp. 1025–1040. [Online]. Available: https://www.usenix.org/conference/atc19/presentation/liu-yizhi
- K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, “The case for a single-chip multiprocessor,” ACM Sigplan Notices, vol. 31, no. 9, pp. 2–11, 1996.
- G. J. Chaitin, “Register allocation & spilling via graph coloring,” ACM Sigplan Notices, vol. 17, no. 6, pp. 98–101, 1982.
- H. Inoue, M. Ohara, and K. Taura, “Faster set intersection with simd instructions by reducing branch mispredictions,” Proceedings of the VLDB Endowment, vol. 8, no. 3, pp. 293–304, 2014.
- N. Shaylor, “A just-in-time compiler for memory-constrained low-power devices.” in Java Virtual Machine Research and Technology Symposium, 2002, pp. 119–126.
- M. Arnold, S. J. Fink, D. Grove, M. Hind, and P. F. Sweeney, “A survey of adaptive optimization in virtual machines,” Proceedings of the IEEE, vol. 93, no. 2, pp. 449–466, 2005.
- A. Farcy, O. Temam, R. Espasa, and T. Juan, “Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes,” in Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 1998, pp. 59–68.
- Intel, “Xadd - exchange and add,” 2023. [Online]. Available: https://www.felixcloutier.com/x86/xadd
- V. S. Pai, P. Ranganathan, and S. V. Adve, “The impact of instruction-level parallelism on multiprocessor performance and simulation methodology,” in Proceedings Third International Symposium on High-Performance Computer Architecture. IEEE, 1997, pp. 72–83.
- S. Eyerman, J. E. Smith, and L. Eeckhout, “Characterizing the branch misprediction penalty,” in 2006 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 2006, pp. 48–58.
- Intel, “Intel® 64 and ia-32 architectures software developer manuals,” 2023. [Online]. Available: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
- T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1–25, 2011.
- “Asmjit project: Low-latency machine code generation,” 2023. [Online]. Available: https://asmjit.com/
- A. C. De Melo, “The new linux’perf’tools,” in Slides from Linux Kongress, vol. 18, 2010, pp. 1–42.
- X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, “Efficient sparse matrix-vector multiplication on x86-based many-core processors,” in Proceedings of the 27th international ACM conference on International conference on supercomputing, 2013, pp. 273–282.
- M. K. Rahman, M. H. Sujon, and A. Azad, “Fusedmm: A unified sddmm-spmm kernel for graph embedding and graph neural networks,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 256–266.
- C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019, pp. 300–314.
- C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, Ü. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan, “Efficient sparse-matrix multi-vector product on gpus,” in Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018, pp. 66–79.
- J. L. Greathouse and M. Daga, “Efficient sparse matrix-vector multiplication on gpus using the csr storage format,” in SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 769–780.
- A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarath, and P. Sadayappan, “Fast sparse matrix-vector multiplication on gpus for graph applications,” in SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 781–792.
- S. Dalton, S. Baxter, D. Merrill, L. Olson, and M. Garland, “Optimizing sparse matrix operations on gpus using merge path,” in 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2015, pp. 407–416.
- M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” in GPU Technology Conference, 2010.