Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

9 tokens/sec

GPT-4o

12 tokens/sec

Gemini 2.5 Pro Pro

40 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

ViTamin: Designing Scalable Vision Models in the Vision-Language Era (2404.02132v2)

Published 2 Apr 2024 in cs.CV

Abstract: Recent breakthroughs in vision-LLMs (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).

References (137)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces ViTamin, a novel hybrid model that refines MBConv and Transformer blocks to achieve superior zero-shot ImageNet accuracy with fewer parameters.
It employs a comprehensive benchmarking protocol comparing ConvNets, ViTs, and hybrids to assess performance scalability across varying data and model sizes.
The results suggest that integrating local feature extraction with global context modeling can lead to resource-efficient, high-performing vision-language models.

ViTamin: Advancing Vision Models for Vision-Language Tasks with New Architectures and Training Protocols

Introduction

The paper introduces ViTamin, a novel architecture designed for vision-LLMs (VLMs), aiming to optimize vision models in the context of large-scale image-text pair training. Distinct from the prevalent use of vanilla Vision Transformers (ViTs) as the default image encoder in VLMs, ViTamin proposes a tailored solution to address scalability and performance under the contrastive language-image pretraining (CLIP) framework. The paper meticulously reevaluates existing vision models, including ViTs, ConvNets, and hybrid architectures, across different scales of model parameters and training data sizes. It culminates in the development of ViTamin, showcasing remarkable improvements over existing models in zero-shot classification tasks and proposing a comprehensive benchmark for future vision model assessments in VLM tasks.

Reevaluating Vision Models in the CLIP Setting

The paper starts by challenging the status quo of employing vanilla ViTs for image encoding in VLMs. It argues that despite the effectiveness of ViTs, the ever-growing datasets for VLMs necessitate a reassessment of architectural choices, including ConvNets and hybrid models. The authors establish a new benchmarking protocol under the CLIP framework, meticulously analyzing model performance across various scales. Key findings from their comprehensive analysis indicate:

Scalability with data size improves performance across all models and scales, with ViTs slightly outperforming others in model parameter scalability.
Higher feature resolution from smaller patch sizes or fine-grained convolutions contributes positively to model performance.
Hybrid models, exemplified by CoAtNet, showcase superior performance to pure ConvNet or Transformer architectures, although scalability challenges arise with the largest CoAtNet variant.

ViTamin: Design and Highlights

Building on these insights, ViTamin introduces a strategic architectural design that integrates the strengths of ConvNets and Transformers. The model is structured into a three-stage network with an initial convolutional stem, followed by Mobile Convolution Blocks (MBConv) in the early stages for local feature extraction, and culminating in Transformer Blocks (TFB) for global context modeling. Key innovations in ViTamin include:

MBConv-LN and TFB-GeGLU Blocks: At the micro-level, ViTamin refines MBConv and TFB blocks for enhanced performance and efficiency. MBConv-LN simplifies conventional MBConv by using a single LayerNorm, while TFB-GeGLU employs Gated Linear Units in FFNs for improved accuracy with fewer parameters.
Scalability with Simplified Design: ViTamin demonstrates significant scalability both in terms of data volume and model size. Its design allows for effective performance improvement with increased training data and supports straightforward scaling rules for creating larger model variants.
Superior Performance: ViTamin notably outperforms its ViT counterparts in zero-shot ImageNet accuracy and demonstrates robust performance across 60 diverse benchmarks. Impressively, ViTamin-XL, with significantly fewer parameters, achieves higher ImageNet zero-shot accuracy than a much larger EVA-E model.

Implications and Future Directions

The introduction of ViTamin and its promising results prompt a reassessment of architectural preferences in the development of VLMs. The findings encourage exploring beyond the ViT archetype, considering hybrid models that leverage both convolutional and transformer strengths. Additionally, the scalability of ViTamin, both in data and model size, underscores the potential for more resource-efficient yet highly performant VLM architectures. As the paper proposes a new suite of benchmarks for VLMs, it sets a foundation for future research to build upon, aiming for models that excel in a broader range of vision-language tasks, including open-vocabulary detection and segmentation and large multi-modal models.

In conclusion, ViTamin marks a significant step forward in the quest for optimizing vision models within the VLM paradigm. Its architectural innovations, coupled with the comprehensive benchmarking efforts, not only advance the state-of-the-art but also broaden the horizon for future explorations in AI's visual and linguistic capabilities.

PDF Markdown

Tweets

https://twitter.com/arankomatsuzaki/status/1775352479212781763

https://twitter.com/fly51fly/status/1775649824127561979

https://twitter.com/KyeGomezB/status/1775525288715121059

https://twitter.com/jieneng_chen/status/1810365094955483288

https://twitter.com/arxivsanitybot/status/1775515681011974541

https://twitter.com/knishimae0531/status/1775670160629919808