Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 41 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 124 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training (2302.08005v2)

Published 16 Feb 2023 in cs.LG

Abstract: Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language, Slapo, to decouple model execution from definition. Specifically, Slapo works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Compared to existing optimization solutions, Slapo progressively optimizes the model "as-needed" through high-level primitives, and thus preserving programmability and debuggability for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way using Slapo, we are able to improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.41x on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM.

Citations (5)

Summary

  • The paper introduces Slapo, a novel schedule language that decouples model execution from definition to optimize large deep learning model training.
  • The paper demonstrates that Slapo achieves up to 2.92× speedup on single-machine setups and 1.41× in multi-node environments, significantly boosting training throughput.
  • The paper highlights key features such as dynamic graph compatibility, structure-preserving scheduling, and auto-tuning to enable efficient and minimal-disruption optimizations.

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training

Introduction to Slapo

The paper introduces Slapo, a schedule language designed to optimize large deep learning (DL) model training by decoupling model execution from definition. Slapo is specifically tailored for frameworks with dynamic graphs, such as PyTorch, which typically suffer from suboptimal performance due to their dynamic nature. Slapo addresses the challenge of balancing usability and performance by providing a set of high-level primitives that allow users to fine-tune model training without modifying the original model definition. This separation enables performance engineers to optimize execution strategies while maintaining the model's original structure.

Core Features of Slapo

Slapo defines a comprehensive set of schedule primitives that facilitate various DL optimization techniques, including efficient kernel utilization, 3D parallelism, activation checkpointing, and more. By allowing users to apply these primitives progressively and selectively, Slapo minimizes disruption to existing model architectures. Key features include:

  1. Dynamic Graph Compatibility: Unlike static graph frameworks, Slapo operates directly on PyTorch models, preserving their dynamic characteristics while enabling significant optimizations.
  2. Structure-Preserving Scheduling: By maintaining model structure hierarchy, Slapo allows developers to troubleshoot and optimize specific modules without altering the entire model.
  3. Auto-Tuning: Slapo includes an auto-tuner capable of exploring optimal configurations within a defined parameter space, reducing the time and effort required to identify the best execution strategies.
  4. Framework Dialects: Slapo supports integration with existing distributed frameworks like DeepSpeed and Megatron-LM, allowing models scheduled by Slapo to leverage advanced runtime optimizations.

Implementation and Evaluation

Slapo is implemented on PyTorch and utilizes torch.fx as its intermediate representation (IR) for static graph tracing. This makes it compatible with PyTorch's existing ecosystem while allowing optimizations at a high level. The evaluation results demonstrate that Slapo can significantly improve training throughput for various large-scale models across different configurations, achieving up to 2.92× speedup on a single machine and up to 1.41× speedup in a multi-node environment compared to state-of-the-art baselines. Figure 1

Figure 1: Overview of Slapo.

Practical Implications and Future Work

Slapo's ability to optimize model execution without modifying the original model makes it a valuable tool for both researchers and industry practitioners seeking to maximize training efficiency. It allows for rapid prototyping and experimentation with different optimization techniques without the risk of introducing errors into the model's core logic. Future work includes developing auto-schedulers that can automatically generate schedule primitives and expanding the library of optimizations to include emerging techniques in DL training.

Conclusion

Slapo represents a significant advancement in the optimization of large DL model training by providing a flexible, user-friendly interface for applying complex scheduling and optimization strategies. Its compatibility with existing frameworks and focus on preserving model structure make it a practical and effective tool for improving training efficiency in real-world applications. Through continued development, Slapo has the potential to become a standard component in the toolkit of DL practitioners looking to optimize training performance.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.