DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers (2403.10266v3)
Abstract: Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, the nature of multi-dimensional transformers involves independent calculations across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension among all sequences according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of implementation with minimal constraints. Experimental evaluations demonstrate DSP's superiority over state-of-the-art embedded sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10x, with less than 25% communication volume.
- Longformer: The long-document transformer. ArXiv, abs/2004.05150, 2020. URL https://api.semanticscholar.org/CorpusID:215737171.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. ArXiv, abs/2311.15127, 2023. URL https://api.semanticscholar.org/CorpusID:265312551.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Tempee: Temporal-spatial parallel transformer for radar echo extrapolation beyond auto-regression. IEEE Transactions on Geoscience and Remote Sensing, 2023.
- Fastfold: Optimizing alphafold training and inference on gpu clusters. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 417–430, 2024.
- Spatial-temporal transformer for dynamic scene graph generation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16352–16362, 2021. URL https://api.semanticscholar.org/CorpusID:236428544.
- Sttre: A spatio-temporal transformer with relative embeddings for multivariate time series forecasting. Neural Networks, 168:549–559, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Rstt: Real-time spatial temporal transformer for space-time video super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17441–17451, 2022.
- End-to-end video object detection with spatial-temporal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1507–1516, 2021.
- Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, 1986.
- Axial attention in multidimensional transformers. ArXiv, abs/1912.12180, 2019. URL https://api.semanticscholar.org/CorpusID:209323787.
- Spatial-temporal convolutional transformer network for multivariate time series forecasting. Sensors, 22(3):841, 2022.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. ArXiv, abs/2309.14509, 2023. URL https://api.semanticscholar.org/CorpusID:262826014.
- Highly accurate protein structure prediction with alphafold. Nature, 596:583 – 589, 2021. URL https://api.semanticscholar.org/CorpusID:235959867.
- Reducing activation recomputation in large transformer models. ArXiv, abs/2205.05198, 2022. URL https://api.semanticscholar.org/CorpusID:248693351.
- Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
- Sequence parallelism: Long sequence training from system perspective. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:246017095.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Latte: Latent diffusion transformer for video generation. ArXiv, abs/2401.03048, 2024. URL https://api.semanticscholar.org/CorpusID:266844878.
- Colabfold: making protein folding accessible to all. Nature methods, 19(6):679–682, 2022.
- Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019.
- Short-term wind speed forecasting based on spatial-temporal graph transformer networks. Energy, 253:124095, 2022.
- Pytorch: An imperative style, high-performance deep learning library. ArXiv, abs/1912.01703, 2019. URL https://api.semanticscholar.org/CorpusID:202786778.
- W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Zero: Memory optimization towards training a trillion parameter models. ArXiv, abs/1910.02054, 2019. URL https://api.semanticscholar.org/CorpusID:203736482.
- Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis, pages 1–14, 2021.
- Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31, 2018.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053, 2019. URL https://api.semanticscholar.org/CorpusID:202660670.
- Make-a-video: Text-to-video generation without text-video data. ArXiv, abs/2209.14792, 2022. URL https://api.semanticscholar.org/CorpusID:252595919.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/CorpusID:257219404.
- Attention is all you need. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
- Spatial-temporal transformer networks for traffic flow forecasting. arXiv preprint arXiv:2001.02908, 2020.
- Learning spatio-temporal transformer for visual tracking. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10428–10437, 2021. URL https://api.semanticscholar.org/CorpusID:232428140.
- Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16442–16453, 2022.
- 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11656–11665, 2021.