Emergent Mind

Abstract

Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).

ViTamin architecture combines convolutional stem, MBConv, Transformer Blocks, outputs feature map with stride 16.

Overview

  • ViTamin introduces a novel architecture for vision-language models (VLMs), aiming to optimize vision models with scalable and high-performance solutions under the contrastive language-image pretraining (CLIP) framework.

  • The paper reevaluates existing vision models, including Vision Transformers (ViTs), ConvNets, and Hybrid architectures, establishing a comprehensive benchmark for VLMs.

  • ViTamin integrates the strengths of ConvNets and Transformers, featuring Mobile Convolution Blocks (MBConv) and Transformer Blocks (TFB) for improved efficiency and performance.

  • The study reveals ViTamin’s superior performance in zero-shot ImageNet accuracy and its scalability, setting a foundation for future research in vision-language tasks.

ViTamin: Advancing Vision Models for Vision-Language Tasks with New Architectures and Training Protocols

Introduction

The paper introduces ViTamin, a novel architecture designed for vision-language models (VLMs), aiming to optimize vision models in the context of large-scale image-text pair training. Distinct from the prevalent use of vanilla Vision Transformers (ViTs) as the default image encoder in VLMs, ViTamin proposes a tailored solution to address scalability and performance under the contrastive language-image pretraining (CLIP) framework. The study meticulously reevaluates existing vision models, including ViTs, ConvNets, and hybrid architectures, across different scales of model parameters and training data sizes. It culminates in the development of ViTamin, showcasing remarkable improvements over existing models in zero-shot classification tasks and proposing a comprehensive benchmark for future vision model assessments in VLM tasks.

Reevaluating Vision Models in the CLIP Setting

The paper starts by challenging the status quo of employing vanilla ViTs for image encoding in VLMs. It argues that despite the effectiveness of ViTs, the ever-growing datasets for VLMs necessitate a reassessment of architectural choices, including ConvNets and hybrid models. The authors establish a new benchmarking protocol under the CLIP framework, meticulously analyzing model performance across various scales. Key findings from their comprehensive analysis indicate:

  • Scalability with data size improves performance across all models and scales, with ViTs slightly outperforming others in model parameter scalability.
  • Higher feature resolution from smaller patch sizes or fine-grained convolutions contributes positively to model performance.
  • Hybrid models, exemplified by CoAtNet, showcase superior performance to pure ConvNet or Transformer architectures, although scalability challenges arise with the largest CoAtNet variant.

ViTamin: Design and Highlights

Building on these insights, ViTamin introduces a strategic architectural design that integrates the strengths of ConvNets and Transformers. The model is structured into a three-stage network with an initial convolutional stem, followed by Mobile Convolution Blocks (MBConv) in the early stages for local feature extraction, and culminating in Transformer Blocks (TFB) for global context modeling. Key innovations in ViTamin include:

  • MBConv-LN and TFB-GeGLU Blocks: At the micro-level, ViTamin refines MBConv and TFB blocks for enhanced performance and efficiency. MBConv-LN simplifies conventional MBConv by using a single LayerNorm, while TFB-GeGLU employs Gated Linear Units in FFNs for improved accuracy with fewer parameters.
  • Scalability with Simplified Design: ViTamin demonstrates significant scalability both in terms of data volume and model size. Its design allows for effective performance improvement with increased training data and supports straightforward scaling rules for creating larger model variants.
  • Superior Performance: ViTamin notably outperforms its ViT counterparts in zero-shot ImageNet accuracy and demonstrates robust performance across 60 diverse benchmarks. Impressively, ViTamin-XL, with significantly fewer parameters, achieves higher ImageNet zero-shot accuracy than a much larger EVA-E model.

Implications and Future Directions

The introduction of ViTamin and its promising results prompt a reassessment of architectural preferences in the development of VLMs. The findings encourage exploring beyond the ViT archetype, considering hybrid models that leverage both convolutional and transformer strengths. Additionally, the scalability of ViTamin, both in data and model size, underscores the potential for more resource-efficient yet highly performant VLM architectures. As the paper proposes a new suite of benchmarks for VLMs, it sets a foundation for future research to build upon, aiming for models that excel in a broader range of vision-language tasks, including open-vocabulary detection and segmentation and large multi-modal models.

In conclusion, ViTamin marks a significant step forward in the quest for optimizing vision models within the VLM paradigm. Its architectural innovations, coupled with the comprehensive benchmarking efforts, not only advance the state-of-the-art but also broaden the horizon for future explorations in AI's visual and linguistic capabilities.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.