Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 133 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

An Optimal Resource Allocator of Elastic Training for Deep Learning Jobs on Cloud (2109.03389v1)

Published 8 Sep 2021 in eess.SY, cs.DC, and cs.SY

Abstract: Cloud training platforms, such as Amazon Web Services and Huawei Cloud provide users with computational resources to train their deep learning jobs. Elastic training is a service embedded in cloud training platforms that dynamically scales up or down the resources allocated to a job. The core technique of an elastic training system is to best allocate limited resources among heterogeneous jobs in terms of shorter queueing delay and higher training efficiency. This paper presents an optimal resource allocator for elastic training system that leverages a mixed-integer programming (MIP) model to maximize the training progress of deep learning jobs. We take advantage of the real-world job data obtained from ModelArts, the deep learning training platform of Huawei Cloud and conduct simulation experiments to compare the optimal resource allocator with a greedy one as benchmark. Numerical results show that the proposed allocator can reduce queuing time by up to 32% and accelerate training efficiency by up to 24% relative to the greedy resource allocator, thereby greatly improving user experience with Huawei ModelArts and potentially enabling the realization of higher profits for the product. Also, the optimal resource allocator is fast in decision-making, taking merely 0.4 seconds on average.

Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.