Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 158 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Forecasting GPU Performance for Deep Learning Training and Inference (2407.13853v3)

Published 18 Jul 2024 in cs.LG and cs.PF

Abstract: Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 121.4% and 30.8% to 2.3% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior work, where both GPT3 and H100 were not used to train the framework.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.