Dissecting GPU Memory Hierarchy through Microbenchmarking

Published 8 Sep 2015 in cs.AR and cs.DC | (1509.02308v2)

Abstract: Memory access efficiency is a key factor in fully utilizing the computational power of graphics processing units (GPUs). However, many details of the GPU memory hierarchy are not released by GPU vendors. In this paper, we propose a novel fine-grained microbenchmarking approach and apply it to three generations of NVIDIA GPUs, namely Fermi, Kepler and Maxwell, to expose the previously unknown characteristics of their memory hierarchies. Specifically, we investigate the structures of different GPU cache systems, such as the data cache, the texture cache and the translation look-aside buffer (TLB). We also investigate the throughput and access latency of GPU global memory and shared memory. Our microbenchmark results offer a better understanding of the mysterious GPU memory hierarchy, which will facilitate the software optimization and modelling of GPU architectures. To the best of our knowledge, this is the first study to reveal the cache properties of Kepler and Maxwell GPUs, and the superiority of Maxwell in shared memory performance under bank conflict.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (199)

View on Semantic Scholar

Summary

The paper reveals distinct cache designs in NVIDIA GPUs, including unequal cache sets and non-LRU policies, through innovative microbenchmarking.
The study identifies performance disparities across Fermi, Kepler, and Maxwell architectures, highlighting Maxwell’s superior shared memory efficiency under bank conflicts.
The paper provides actionable insights for optimizing GPU applications by linking memory latency, throughput, and cache structure analysis.

An Analytical Assessment of GPU Memory Hierarchy via Microbenchmarking

The paper "Dissecting GPU Memory Hierarchy through Microbenchmarking" by Xinxin Mei and Xiaowen Chu undertakes a detailed examination of the intricacies of memory hierarchy in NVIDIA GPUs. Through a methodological approach centered on microbenchmarking, the paper extends our understanding of the cache structures within GPUs, offering crucial insights needed for software optimization and architectural modeling.

The research specifically addresses three generations of NVIDIA GPUs—Fermi, Kepler, and Maxwell—exploring the components and functioning of their memory hierarchies. The study focuses on cache systems such as data cache, texture cache, and translation look-aside buffer (TLB), as well as the throughput and access latency of the GPUs' global and shared memory. This nuanced examination is the first to illuminate the cache characteristics of Kepler and Maxwell GPUs, emphasizing the shared memory strengths of the Maxwell architecture under bank conflict conditions.

GPU Cache Exploration and Insights

The authors adopt a unique fine-grained microbenchmarking approach, tailor-made for this study, to reveal the non-traditional properties of GPU caches. Unlike conventional CPU caches characterized by regular set-associative mapping, this study exposes numerous distinctive features:

Unequal Cache Sets and Associativities: The study finds unequal cache sets and cache line sizes in both the Kepler and Maxwell architectures, differing from CPU norms.
Non-LRU Replacement Policies: Indications of non-LRU (Least Recently Used) policies in the L1 data caches shed light on the nuanced handling of caching entries, particularly in Fermi GPUs.
Cache and TLB Structure: By elucidating the specific structures and configurations of L1 and L2 TLBs, the authors highlight differences in memory mapping and the impact on memory latency.

Throughput and Latency Findings

Significant throughput variations were identified across the generations, with the Kepler GPUs exhibiting the highest global memory throughput due to a wider memory bus. However, the Maxwell architecture demonstrated improvements in energy efficiency and complexity reduction. The paper details how these differences reflect an evolution in design, aimed at reducing memory bandwidth and optimizing shared memory performance. Notably, Maxwell's shared memory bank conflict latency improvements remain a standout evolution, leading to enhanced performance in GPGPU tasks heavily reliant on shared memory access.

Practical Implications and Theoretical Advancements

From a practical standpoint, the insights provided by the microbenchmarking results suggest pathways for optimizing memory-intensive computational tasks on GPUs. Developers can potentially tailor their applications to leverage unique architectural features such as cache characteristics and memory throughput rates. Theoretically, the nuanced understanding of GPU memory handling and architectural evolution informs future GPU design and architectural modeling, potentially impacting how new generations of GPUs are approached in design and development.

Future Prospects

The study opens avenues for further analysis of emerging GPU architectures, especially as newer NVIDIA models are released. An exploration into how these architectures integrate evolving software paradigms and support increasing computational demands could further enhance the capabilities of large-scale parallel computing frameworks.

Overall, "Dissecting GPU Memory Hierarchy through Microbenchmarking" provides a foundational and comprehensive analysis for experienced researchers and developers keen on optimizing GPU resources for computational efficiency in diverse applications. The empirical findings and detailed structural insights serve as a stepping stone for both theoretical advancement and practical GPU application enhancement.

Markdown Report Issue