Emergent Mind

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

(2402.04252)
Published Feb 6, 2024 in cs.CV

Abstract

Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.

Comparison of zero-shot classification across benchmarks, showing CLIP's improvement over state-of-the-art with scaling.

Overview

  • EVA-CLIP-18B is an 18-billion parameter model based on the EVA scaling philosophy, improving the performance of CLIP models in vision and multimodal tasks.

  • Despite training on a smaller dataset, EVA-CLIP-18B achieves 80.7% average zero-shot top-1 accuracy across benchmarks, indicating significant advancements over previous models.

  • The model is robust, with high recall in image-text retrieval and minimal accuracy drop against adversarial ImageNet variants, showcasing resilience to distributional shifts.

  • The paper details the model's training process, ablation studies, and offers insights applicable to future research in the scaling of vision models.

Introduction

The quest for advancing the capabilities of Contrastive Language-Image Pretraining (CLIP) models has led to significant developments in the field of AI. CLIP models have become a cornerstone for both vision and multimodal tasks by establishing robust and transferable visual representations that can be effectively paired with textual data. A recent stride in this area is the development of EVA-CLIP-18B, an 18-billion parameter CLIP model, built on the EVA scaling philosophy. This model represents an open-source milestone, not only due to its sheer scale but also due to its remarkable zero-shot learning performance on a diverse range of benchmarks.

Scaling Vision Models

EVA-CLIP-18B exemplifies a weak-to-strong scaling approach, initially distilled from a 5-billion parameter EVA-CLIP teacher model. The EVA philosophy encourages progressive scaling to bolster the visual models. The training leveraged a dataset smaller than those used by competing models - consisting of only 2-billion image-text pairs from LAION-2B and COYO-700M - yet the model saw only 6-billion samples during its training regimen. Despite this, the results are nothing short of extraordinary: EVA-CLIP-18B surpassed its forerunner and other open-source models with an unprecedented 80.7% average zero-shot top-1 accuracy across 27 image classification benchmarks.

Performance and Robustness Analysis

Comprehensive evaluations demonstrate that EVA-CLIP-18B's performance improved consistently with scaling, without displaying performance saturation. The model shone across various assessments, ranging from zero-shot image and video classifications to image-text retrieval tasks. Notable findings include an average recall of 87.8% across retrieval benchmarks and averagely topping its closest open-source rival by 1.5% and the largest existing CLIP model by 2.7%. Furthermore, it displayed impressive robustness, gauged by a minimal accuracy drop - only 0.2% - when encountering adversarial ImageNet variants, revealing remarkable resilience to distributional shifts in visual data.

Ablation Studies and Training Insights

The authors also conducted ablation studies, specifically to understand the influence of image transformations on model evaluation. It was observed that direct resizing of images yields considerable performance variability across different tasks. These findings underscore the nuanced effects of preprocessing steps on large-scale model evaluations. Additionally, the paper provides detailed insights into the model training settings and optimizations, including the utilization of techniques such as mixed precision training, layer-wise learning rate decay, and DeepSpeed's ZeRO optimization for efficient use of computational resources.

Future Scope and Contributions

The EVA-CLIP-18B model serves not just as a benchmark in CLIP model scaling but also reinforces the feasibility of achieving state-of-the-art results without resorting to extraordinarily large datasets. Its open-source availability paves the way for future research and the development of even more formidable vision and multimodal foundation models. The paper's training strategies and ablation findings offer practical guidance for future explorations in scaling vision models, ensuring that the realm of generative AI continues to evolve in a well-founded and empirically driven manner.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.