Benchmarking and Dissecting the Nvidia Hopper GPU Architecture (2402.13499v1)
Abstract: Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by AI utilizing deep learning techniques. A substantial body of studies have been dedicated to dissecting the microarchitectural metrics characterizing diverse GPU generations, which helps researchers understand the hardware details and leverage them to optimize the GPU programs. However, the latest Hopper GPUs present a set of novel attributes, including new tensor cores supporting FP8, DPX, and distributed shared memory. Their details still remain mysterious in terms of performance and operational characteristics. In this research, we propose an extensive benchmarking study focused on the Hopper GPU. The objective is to unveil its microarchitectural intricacies through an examination of the new instruction-set architecture (ISA) of Nvidia GPUs and the utilization of new CUDA APIs. Our approach involves two main aspects. Firstly, we conduct conventional latency and throughput comparison benchmarks across the three most recent GPU architectures, namely Hopper, Ada, and Ampere. Secondly, we delve into a comprehensive discussion and benchmarking of the latest Hopper features, encompassing the Hopper DPX dynamic programming (DP) instruction set, distributed shared memory, and the availability of FP8 tensor cores. The microbenchmarking results we present offer a deeper understanding of the novel GPU AI function units and programming features introduced by the Hopper architecture. This newfound understanding is expected to greatly facilitate software optimization and modeling efforts for GPU architectures. To the best of our knowledge, this study makes the first attempt to demystify the tensor core performance and programming instruction sets unique to Hopper GPUs.
- L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30, no. 4, pp. 681–694, 2020.
- W. Sun, A. Li, T. Geng, S. Stuijk, and H. Corporaal, “Dissecting tensor cores via microbenchmarks: Latency, throughput and numeric behaviors,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 246–261, 2023.
- S. Hong and H. Kim, “An integrated GPU power and performance model,” in International Symposium on Computer Architecture (ISCA), 2010.
- L. Braun, S. Nikas, C. Song, V. Heuveline, and H. Fröning, “A simple model for portable and fast prediction of execution time and power consumption of GPU kernels,” ACM Transactions Architecture and Code Optimization, vol. 18, no. 1, dec 2021.
- X. Wang, K. Huang, A. Knoll, and X. Qian, “A hybrid framework for fast and accurate GPU performance estimation through source-level analysis and trace-based simulation,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 506–518.
- Q. Wang and X. Chu, “GPGPU performance estimation with core and memory frequency scaling,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 12, pp. 2865–2881, 2020.
- X. Mei, Q. Wang, and X. Chu, “A survey and measurement study of gpu dvfs on energy conservation,” Digital Communications and Networks, vol. 3, no. 2, pp. 89–100, 2017.
- Y. Arafa, A. ElWazir, A. ElKanishy, Y. Aly, A. Elsayed, A.-H. Badawy, G. Chennupati, S. Eidenbenz, and N. Santhi, “Verified instruction-level energy consumption measurement for nvidia gpus,” in Proceedings of the 17th ACM International Conference on Computing Frontiers, 2020, pp. 60–70.
- R. van Stigt, S. N. Swatman, and A.-L. Varbanescu, “Isolating gpu architectural features using parallelism-aware microbenchmarks,” in Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering, 2022, pp. 77–88.
- Y. Arafa, A.-H. A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz, “Low overhead instruction latency characterization for nvidia gpgpus,” in 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2019, pp. 1–8.
- M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated GPU modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486.
- A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in 2009 IEEE international symposium on performance analysis of systems and software. IEEE, 2009, pp. 163–174.
- J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, “GPUWattch: Enabling energy optimizations in GPGPUs,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ser. ISCA ’13, 2013, p. 487–498.
- N.-M. Ho and W.-F. Wong, “Exploiting half precision arithmetic in nvidia gpus,” in 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2017, pp. 1–7.
- D. Yan, W. Wang, and X. Chu, “Optimizing batched Winograd convolution on GPUs,” in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’20, 2020, p. 32–44.
- H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying gpu microarchitecture through microbenchmarking,” in 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE, 2010, pp. 235–246.
- X. Mei, K. Zhao, C. Liu, and X. Chu, “Benchmarking the memory hierarchy of modern gpus,” in Network and Parallel Computing: 11th IFIP WG 10.3 International Conference, NPC 2014, Ilan, Taiwan, September 18-20, 2014. Proceedings 11. Springer, 2014, pp. 144–156.
- X. Mei and X. Chu, “Dissecting GPU memory hierarchy through microbenchmarking,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 1, pp. 72–86, 2017.
- Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the NVIDIA Volta GPU architecture via microbenchmarking,” arXiv preprint arXiv:1804.06826, 2018.
- Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, “Dissecting the NVidia Turing T4 GPU via microbenchmarking,” arXiv preprint arXiv:1903.07486, 2019.
- S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter, “NVIDIA tensor core programmability, performance & precision,” in 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018, pp. 522–531.
- M. Martineau, P. Atkinson, and S. McIntosh-Smith, “Benchmarking the NVIDIA V100 GPU and tensor cores,” in Euro-Par 2018: Parallel Processing Workshops. Cham: Springer International Publishing, 2019, pp. 444–455.
- D. Yan, W. Wang, and X. Chu, “Demystifying tensor cores to optimize half-precision matrix multiply,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 634–643.
- M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling deep learning accelerator enabled GPUs,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 79–92.
- M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Numerical behavior of NVIDIA tensor cores,” PeerJ Computer Science, vol. 7, p. e330, 2021.
- Z. Tang, Y. Wang, Q. Wang, and X. Chu, “The impact of gpu dvfs on the energy and performance of deep learning: An empirical study,” in Proceedings of the Tenth ACM International Conference on Future Energy Systems, 2019, pp. 315–325.
- Y. Wang, Q. Wang, S. Shi, X. He, Z. Tang, K. Zhao, and X. Chu, “Benchmarking the performance and energy efficiency of ai accelerators for ai training,” in 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 2020, pp. 744–751.
- R. Saavedra and A. Smith, “Measuring cache and TLB performance and their effect on benchmark runtimes,” IEEE Transactions on Computers, vol. 44, no. 10, p. 1223–1235, Jan 1995. [Online]. Available: http://dx.doi.org/10.1109/12.467697
- R. Saavedra-Barrera, “CPU performance evaluation and execution time prediction using narrow spectrum benchmarking,” Jan 1992.
- N. Corporation. (2023) CUDA documentation: Parallel thread execution. [Online]. Available: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
- NVIDIA. (2022) Transformerengine. [Online]. Available: https://github.com/NVIDIA/TransformerEngine
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with IO-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- N. Shazeer, “GLU variants improve transformer.” arXiv: Learning,arXiv: Learning, Feb 2020.
- B. Zhang and R. Sennrich, “Root mean square layer normalization,” Neural Information Processing Systems,Neural Information Processing Systems, Dec 2019.
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- OpenAI. Introducing ChatGPT. [Online]. Available: https://openai.com/blog/chatgpt