Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers (2403.10266v3)

Published 15 Mar 2024 in cs.DC and cs.LG

Abstract: Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, the nature of multi-dimensional transformers involves independent calculations across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension among all sequences according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of implementation with minimal constraints. Experimental evaluations demonstrate DSP's superiority over state-of-the-art embedded sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10x, with less than 25% communication volume.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Longformer: The long-document transformer. ArXiv, abs/2004.05150, 2020. URL https://api.semanticscholar.org/CorpusID:215737171.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. ArXiv, abs/2311.15127, 2023. URL https://api.semanticscholar.org/CorpusID:265312551.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Tempee: Temporal-spatial parallel transformer for radar echo extrapolation beyond auto-regression. IEEE Transactions on Geoscience and Remote Sensing, 2023.
  5. Fastfold: Optimizing alphafold training and inference on gpu clusters. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 417–430, 2024.
  6. Spatial-temporal transformer for dynamic scene graph generation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16352–16362, 2021. URL https://api.semanticscholar.org/CorpusID:236428544.
  7. Sttre: A spatio-temporal transformer with relative embeddings for multivariate time series forecasting. Neural Networks, 168:549–559, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Rstt: Real-time spatial temporal transformer for space-time video super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17441–17451, 2022.
  12. End-to-end video object detection with spatial-temporal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1507–1516, 2021.
  13. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, 1986.
  14. Axial attention in multidimensional transformers. ArXiv, abs/1912.12180, 2019. URL https://api.semanticscholar.org/CorpusID:209323787.
  15. Spatial-temporal convolutional transformer network for multivariate time series forecasting. Sensors, 22(3):841, 2022.
  16. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  17. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. ArXiv, abs/2309.14509, 2023. URL https://api.semanticscholar.org/CorpusID:262826014.
  18. Highly accurate protein structure prediction with alphafold. Nature, 596:583 – 589, 2021. URL https://api.semanticscholar.org/CorpusID:235959867.
  19. Reducing activation recomputation in large transformer models. ArXiv, abs/2205.05198, 2022. URL https://api.semanticscholar.org/CorpusID:248693351.
  20. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
  21. Sequence parallelism: Long sequence training from system perspective. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:246017095.
  22. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
  23. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  24. Latte: Latent diffusion transformer for video generation. ArXiv, abs/2401.03048, 2024. URL https://api.semanticscholar.org/CorpusID:266844878.
  25. Colabfold: making protein folding accessible to all. Nature methods, 19(6):679–682, 2022.
  26. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019.
  27. Short-term wind speed forecasting based on spatial-temporal graph transformer networks. Energy, 253:124095, 2022.
  28. Pytorch: An imperative style, high-performance deep learning library. ArXiv, abs/1912.01703, 2019. URL https://api.semanticscholar.org/CorpusID:202786778.
  29. W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  30. Zero: Memory optimization towards training a trillion parameter models. ArXiv, abs/1910.02054, 2019. URL https://api.semanticscholar.org/CorpusID:203736482.
  31. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis, pages 1–14, 2021.
  32. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
  33. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
  34. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  35. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31, 2018.
  36. Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053, 2019. URL https://api.semanticscholar.org/CorpusID:202660670.
  37. Make-a-video: Text-to-video generation without text-video data. ArXiv, abs/2209.14792, 2022. URL https://api.semanticscholar.org/CorpusID:252595919.
  38. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/CorpusID:257219404.
  39. Attention is all you need. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
  40. Spatial-temporal transformer networks for traffic flow forecasting. arXiv preprint arXiv:2001.02908, 2020.
  41. Learning spatio-temporal transformer for visual tracking. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10428–10437, 2021. URL https://api.semanticscholar.org/CorpusID:232428140.
  42. Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16442–16453, 2022.
  43. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11656–11665, 2021.
Citations (5)

Summary

  • The paper introduces DSP to dynamically adjust parallelism in multi-dimensional transformers, enhancing throughput and reducing communication costs.
  • It achieves throughput improvements between 42.0% and 216.8% while reducing communication volume by at least 75% compared to existing methods.
  • DSP supports large models and diverse attention kernels with minimal integration overhead, paving the way for scalable long sequence processing.

Dynamic Sequence Parallelism: Enhancing Multi-dimensional Transformers Efficiency

Introduction

In the landscape of deep learning, especially within domains like natural language generation, video generation, and multimodal applications, managing long sequences remains a pivotal challenge. Traditional sequence parallelism techniques, developed to distribute long sequences across multiple devices efficaciously, unfortunately grapple with multi-dimensional transformers' complexity due to their inherent structure, leading to inefficacies such as high activation memory costs and slow generation speeds. This paper introduces Dynamic Sequence Parallelism (DSP), a novel method specifically designed to address these issues by efficiently scaling multi-dimensional transformers for long sequence processing.

System Design and Key Contributions

DSP's design philosophy capitalizes on dynamically adjusting the parallelism dimension to align with the current computation stage, a move that starkly contrasts with existing methods that staticize the dimension of sequence parallelism, oblivious to the computation phase. This dynamic approach not only ensures minimal communication costs but also significantly boosts the end-to-end throughput.

The key contributions of the paper are manifold:

  • Introduction of DSP to scale long sequences efficiently in multi-dimensional transformers, leveraging dynamic dimension switching.
  • Demonstrated improvements in end-to-end throughput by between 42.0\% and 216.8%, alongside a minimum communication volume reduction of 75% when compared to leading sequence parallelism methods.
  • Compatibility of DSP with large model sizes and long sequence lengths, enhanced by its support for various attention kernels and integration with ZeRO for reduced memory overhead.
  • DSP's design promotes ease of use and portability, requiring minimal code modifications for integration into existing frameworks.

Theoretical Underpinnings and Comparative Analysis

Communication and Memory Analysis

A detailed communication analysis showcases DSP's capacity to significantly trim down the number of required communication operations and the associated volume. For instance, in contrast to similar methods like Megatron-SP and DeepSpeed-Ulysses, which necessitate four and two AlltoAll operations respectively for a single transformer block, DSP necessitates a mere two, thus markedly reducing the communication volume to $2M/N$. This efficient communication strategy not only minimizes latency but also harnessed effectively, could scale dramatically well in larger cluster settings, providing a robust solution for training and inference on exceedingly long sequences.

Furthermore, the activation and parameter memory efficiencies of DSP are thoroughly dissected, revealing its superior capability in minimizing activation costs through reduced shape transformations and communication overheads. Employing ZeRO for parameter memory management, DSP aligns with the best practices in the domain, effectively managing memory consumption while maintaining scalability.

Experimental Validation

The empirical analyses, conducted over NVIDIA H800 GPUs, pit DSP against contemporary sequence parallelism counterparts across various configurations within the spatial-temporal transformer model domain. The robust improvements in throughput, notably up to 216.8\%, underscore DSP's potential to redefine efficiency standards in processing long sequences within the field of multi-dimensional transformers.

Implications and Future Perspectives

DSP represents a significant step forward in the evolution of sequence parallelism techniques, particularly for applications involving multi-dimensional transformers. By dynamically adjusting the parallelism dimension, DSP not only enhances computational efficiency but also paves the way for tackling previously infeasible long sequence processing tasks across various domains. The methodology's portability and minimal integration overhead further bolster its appeal, potentially accelerating its adoption in both academic and industrial settings.

Looking ahead, the implications of DSP's success could be far-reaching. Future explorations might delve into tailoring DSP's dynamic parallelism approach to other model architectures or investigating its integration with emerging transformer variants. Moreover, the profound communication and memory efficiency gains herald a promising avenue for deploying more complex and larger-scale models, even on constrained hardware environments.

In conclusion, Dynamic Sequence Parallelism emerges as a potent methodology for surmounting the challenges of scaling multi-dimensional transformers for long sequences, showcasing remarkable improvements in throughput and communication efficiency. As the deep learning landscape continues to evolve, techniques like DSP will undoubtedly play a critical role in enabling the next generation of AI applications.