PaReprop: Fast Parallelized Reversible Backpropagation (2306.09342v1)
Abstract: The growing size of datasets and deep learning models has made faster and memory-efficient training crucial. Reversible transformers have recently been introduced as an exciting new method for extremely memory-efficient training, but they come with an additional computation overhead of activation re-computation in the backpropagation phase. We present PaReprop, a fast Parallelized Reversible Backpropagation algorithm that parallelizes the additional activation re-computation overhead in reversible training with the gradient computation itself in backpropagation phase. We demonstrate the effectiveness of the proposed PaReprop algorithm through extensive benchmarking across model families (ViT, MViT, Swin and RoBERTa), data modalities (Vision & NLP), model sizes (from small to giant), and training batch sizes. Our empirical results show that PaReprop achieves up to 20% higher training throughput than vanilla reversible training, largely mitigating the theoretical overhead of 25% lower throughput from activation recomputation in reversible training. Project page: https://tylerzhu.com/pareprop.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc., 2020.
- Invertible residual networks. In International Conference on Machine Learning, pages 573–582. PMLR, 2019.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- Reversible architectures for arbitrarily deep residual neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. ICLR, 2021.
- Multiscale vision transformers. In Proc. ICCV, 2021.
- Invertible convolutional networks. In Workshop on Invertible Neural Nets and Normalizing Flows, International Conference on Machine Learning, 2019.
- The reversible residual network: Backpropagation without storing activations. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 2211–2221, 2017.
- Layer-wise invertibility for extreme memory cost reduction of cnn training. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Flow++: Improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pages 2722–2730. PMLR, 2019.
- Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- m-revnet: Deep reversible neural networks with momentum. arXiv preprint arXiv:2108.05862, 2021.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022.
- Make your pre-trained model reversible: From parameter to memory efficient fine-tuning, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. CVPR, 2022.
- Reversible vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10830–10840, 2022.
- Momentum residual neural networks. arXiv preprint arXiv:2102.07870, 2021.
- Mintnet: Building invertible neural networks with masked convolutions. arXiv preprint arXiv:1907.07945, 2019.
- Re^ 2tal: Rewiring pretrained video backbones for reversible temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Tyler Zhu (11 papers)
- Karttikeya Mangalam (32 papers)