Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 33 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 220 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs (2201.02789v1)

Published 8 Jan 2022 in cs.DC and cs.AR

Abstract: Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization. To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child threads in the parent thread otherwise. Coarsening involves executing the work of multiple thread blocks by a single coarsened block to amortize the common work across them. Aggregation involves combining multiple child grids into a single aggregated grid. Our evaluation shows that our compiler framework improves the performance of applications with nested parallelism by a geometric mean of 43.0x over applications that use dynamic parallelism, 8.7x over applications that do not use dynamic parallelism, and 3.6x over applications that use dynamic parallelism with aggregation alone as proposed in prior work.

Citations (4)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.