Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Dissecting the NVidia Turing T4 GPU via Microbenchmarking (1903.07486v1)

Published 18 Mar 2019 in cs.DC

Abstract: In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible performance. Last year, these very reasons motivated us to dissect the Volta GPU architecture using microbenchmarks. The introduction in August 2018 of Turing, NVidia's latest architecture, pressed us to update our study. In this report, we examine Turing and compare it quantitatively against previous NVidia GPU generations. Specifically, we study the T4 GPU: a low-power board aiming at inference applications. We describe its improvements against its inference-oriented predecessor: the P4 GPU based on the Pascal architecture. Both T4 and P4 GPUs achieve significantly higher frequency-per-Watt figures than their full-size counterparts. We study the performance of the T4's TensorCores, finding a much higher throughput on low-precision operands than on the P4 GPU. We reveal that Turing introduces new instructions that express matrix math more succinctly. We map Turing's instruction space, finding the same encoding as Volta, and additional instructions. We reveal that the Turing TU104 chip has the same memory hierarchy depth as the Volta GV100; cache levels sizes on the TU104 are frequently twice as large as those found on the Pascal GP104. We benchmark each constituent of the T4 memory hierarchy and find substantial overall performance improvements over its P4 predecessor. We studied how clock throttling affects compute-intensive workloads that hit power or thermal limits. Many of our findings are novel, published here for the first time. All of them can guide high-performance software developers get closer to the GPU's peak performance.

Citations (101)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.