An Inverse Scaling Law for CLIP Training

Published 11 May 2023 in cs.CV | (2305.07017v2)

Abstract: CLIP, one of the pioneering foundation models that connect images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even with limited computational resources. For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling up -- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot accuracy, and meanwhile accelerate the training by ~33x compared to its OpenCLIP counterpart. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

Abstract PDF Upgrade to Chat

Citations (43)

View on Semantic Scholar

Summary

The paper introduces an inverse scaling law, showing that larger encoders can use shorter token sequences with minimal performance loss.
It details token reduction strategies, such as image resizing and syntax masking, that preserve semantic information during training.
The proposed CLIPA framework achieves 69.3% zero-shot top-1 ImageNet-1k accuracy on eight A100 GPUs, significantly enhancing training efficiency.

Overview of "An Inverse Scaling Law for CLIP Training"

The paper "An Inverse Scaling Law for CLIP Training" presents an intriguing finding in the domain of Contrastive Language–Image Pre-training (CLIP) by introducing an inverse scaling law. This study is significant in the ongoing discourse about the computational demands of training large-scale models, offering potential pathways to mitigate resource constraints without significantly compromising performance.

CLIP has revolutionized the interaction between images and text, enabling advancements in zero-shot learning paradigms. However, the extensive computational requirements associated with such models have been a barrier to broader research endeavors. The investigation into the inverse scaling law provides insights into optimizing training processes to reduce these demands.

Key Findings

Inverse Scaling Law: A surprising discovery is made that larger image/text encoders facilitate the use of shorter image/text token sequences during CLIP training with minimal impact on performance. This contrasts with the prevailing understanding in model scaling, where larger models typically require more extensive resources.
Token Reduction Strategies: Comprehensive experiments were conducted to explore various strategies for reducing image and text tokens. Among these, semantic information-preserving strategies such as image resizing and syntax masking for text were found to yield the best scaling results.
Improvements in Training Efficiency: The proposed CLIPA framework leverages the inverse scaling law, enabling efficient CLIP training even with constrained resources like an 8 A100 GPU setup. The framework achieves notable performance benchmarks in significantly reduced time frames, demonstrating a potentially transformative impact on resource management in AI research.
Significant Results: The CLIPA framework achieves a zero-shot top-1 ImageNet-1k accuracy of 69.3% using eight A100 GPUs over just four days—demonstrating substantial efficiency compared to training regimes that demand hundreds of GPUs over extended periods.

Implications and Future Directions

The findings in this paper imply not only practical strategies for enhancing the accessibility and efficiency of training foundation models but also prompt a re-evaluation of resource allocation in AI research. The inverse scaling law underscores the potential for larger models to achieve competitive performance with significantly fewer computational resources, thereby democratizing research and development in this space.

The ability to reduce necessary input token lengths without performance degradation opens avenues for further exploration of adaptive training methodologies. Future research could explore the boundary conditions of this scaling law, extend these findings to other foundation models, or investigate hybrid strategies combining the benefits of various token reduction and resizing techniques.

In a rapidly evolving landscape where computational constraints often limit research, the insights and methods introduced here could catalyze broader participation and innovation. This work potentially paves the way for more sustainable and inclusive advancements in AI.

Markdown Report Issue