Reversible Vision Transformers

Published 9 Feb 2023 in cs.CV and cs.AI | (2302.04869v1)

Abstract: We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 2.3x over their non-reversible counterparts. Full code and trained models are available at https://github.com/facebookresearch/slowfast. A simpler, easy to understand and modify version is also available at https://github.com/karttikeya/minREV

Abstract PDF Upgrade to Chat

Authors (7)

Citations (37)

View on Semantic Scholar

Summary

The paper presents a novel reversible vision transformer design that achieves up to 15.5× memory reduction without increasing model complexity.
The paper reports up to 2.3× throughput improvement by efficiently recomputing activations in deeper architectures.
The paper introduces a dual-stream architecture with modified training recipes that enhance regularization and training stability.

Reversible Vision Transformers: An Overview

The paper presents an innovative design for visual recognition models, introducing Reversible Vision Transformers (Rev-ViT). This architecture is notable for its memory-efficient structure, which effectively decouples GPU memory requirements from model depth. Such a development allows for the scaling of architectures with optimized memory usage. This work adapts two prominent models: the Vision Transformer (ViT) and Multiscale Vision Transformers (MViT), into reversible variants. These adaptations are extensively benchmarked across various model sizes and tasks, including image classification, object detection, and video classification.

Key Contributions and Findings

Memory Efficiency: Reversible Vision Transformers demonstrate a significant reduction in memory footprint — up to 15.5× reduced — without an increase in model complexity, parameters, or a decrease in accuracy. This characteristic suggests that Rev-ViTs can serve as an efficient backbone for training regimes with limited hardware resources.
Throughput Improvement: Additionally, deeper Rev-ViT models report an increase in throughput of up to 2.3× compared to their non-reversible counterparts, despite the added computational burden of recomputing activations.
Architecture Design: The Rev-ViT introduces a two-residual-stream architecture that operates efficiently without internal skip connections in deeper layers, crucial for maintaining training stability without compromising convergence.
Training Recipes: The study observes that reversible transformers possess stronger inherent regularization than standard networks. This necessitated the development of new training regimens, modifying augmentation strategies and leveraging lighter augmentation recipes to meet or exceed the performance of non-reversible models.

Implications and Future Directions

The Reversible Vision Transformers open a path toward more resource-efficient deep learning models, particularly beneficial in environments with GPU constraints. Practically, this becomes increasingly critical as AI models scale and their demand for computational resources grows. Theoretical implications extend to the understanding of network depth and memory management strategies, posing new questions about the trade-offs between computation and memory.

Speculatively, these advancements could lead to more breakthroughs in distributed and parallel processing strategies, potentially influencing future developments in model optimization for edge computing and real-time applications. Further work could explore integrating reversible structures into other neural architectures, potentially discovering new efficiencies across various AI domains.

This paper lays a robust foundation for future explorations in model efficiency, encouraging further research into the broader applicability and performance implications of reversible models within and beyond the field of vision transformers. As the community delves deeper into these architectures, there remains potential for significant impact and innovation in model training and deployment strategies.

Markdown Report Issue