OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Published 28 Oct 2021 in cs.DC, cs.AI, and cs.LG | (2110.15032v6)

Abstract: Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning. We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks. The code of OneFlow is available at: https://github.com/Oneflow-Inc/oneflow.

Abstract PDF Upgrade to Chat

Authors (12)

Citations (33)

View on Semantic Scholar

Summary

The paper introduces OneFlow, a framework that leverages SBP abstraction and the actor model to simplify data, model, and pipeline parallelism.
The paper details an automated execution plan that converts logical graphs into optimized physical graphs with efficient runtime operations.
The paper demonstrates superior performance on models like ResNet and BERT while reducing manual configuration and engineering overhead.

Overview of OneFlow: A Redesign of Distributed Deep Learning Framework

The paper presents OneFlow, a novel distributed deep learning framework designed to address the limitations of current frameworks such as TensorFlow and PyTorch, particularly in their handling of large-scale models and distributed training. OneFlow is structured around two primary innovations: the SBP (split, broadcast, and partial-value) abstraction and the actor model. These concepts streamline the expression and execution of diverse parallelism strategies, including data parallelism, model parallelism, and pipeline parallelism.

Key Innovations

OneFlow's SBP abstraction provides a flexible and intuitive way to map global tensors across distributed devices. By specifying how data is divided (split), duplicated (broadcast), or partially aggregated (partial-value), SBP enables efficient inter-device communication. This allows developers to seamlessly express complex parallelism strategies without exploring low-level communications.

Complementing SBP, the actor model is employed to manage runtime operations. Each actor encapsulates a specific operation, handling computations, data movements, and dependencies through message passing. This design reduces complexity by unifying various dependency management mechanisms via an elegant, asynchronous protocol.

Implementation Highlights

One significant advantage of OneFlow is its ability to generate execution plans automatically. The framework translates logical graphs into optimized physical graphs, identifying and inserting necessary data-routing operations, thus minimizing manual intervention. This automation potentially reduces overheads and system inefficiencies common in existing frameworks that require additional plugins or complex customization for model or pipeline parallelism.

The runtime design based on actor models inherently supports pipeline parallelism and back-pressure flow control mechanisms, optimizing resource utilization and preventing deadlocks in scenarios of complex inter-dependencies. This approach allows for overlapping data preprocessing, computation, and communication, thereby enhancing throughput.

Comparative Evaluation

The paper includes a series of empirical evaluations to demonstrate OneFlow's advantages. It achieves superior or comparable performance to leading frameworks and specialized libraries across various deep learning models, including ResNet, BERT, InsightFace, and large-scale recommender systems. With fewer engineering efforts, OneFlow also offers higher flexibility and ease of use, especially in hybrid parallelism scenarios.

Implications and Future Directions

OneFlow's design provides a robust foundation for scalable and efficient distributed deep learning, showcasing potential improvements in computational throughput and resource management. Practically, this could accelerate training times for emerging large-scale neural networks and adapt to diverse workloads with minimal modification.

Theoretically, the abstraction schemes and runtime management proposed in OneFlow may influence future designs in deep learning frameworks. The successful implementation of the actor model suggests promising avenues for further exploration into distributed computing frameworks.

Moving forward, OneFlow's development could focus on enhancing elasticity and fault tolerance, as well as automated parallelism configuration. This may involve developing sophisticated cost models and integration strategies to adapt dynamically to changing computational and network environments.

In conclusion, OneFlow represents a significant step toward a more coherent approach to distributed deep learning, merging cutting-edge theoretical concepts with practical, scalable system design.

Markdown Report Issue