Emergent Mind

Accessing Vision Foundation Models at ImageNet-level Costs

(2407.10366)
Published Jul 15, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M).

Proteus-S/14 outperforms OpenCLIP-B/32 on 15 benchmarks, closely trailing DINOv2-S/14.

Overview

  • The paper introduces 'Proteus', a framework designed to distill vision foundation models into smaller derivatives without requiring access to extensive original training data, thus democratizing access for the broader research community.

  • Proteus employs a knowledge distillation approach that minimizes dataset bias and incorporates multi-level training objectives—such as token-level, patch-level, and feature-level—to enhance knowledge transfer and generalization.

  • Empirical validation demonstrates that Proteus surpasses existing models on several benchmarks, showing its effectiveness and scalability, making it a valuable tool for resource-limited researchers.

Accessing Vision Foundation Models at ImageNet-level Costs

The paper, "Accessing Vision Foundation Models at ImageNet-level Costs," addresses a critical issue within the domain of vision foundation models: the significant demand for extensive resources and inaccessible training data. These challenges pose substantial barriers for researchers aiming to develop derivative models. The proposed solution, termed "Proteus," leverages a simple and general framework for distilling foundation models into smaller equivalents without access to the original training data, making the process more accessible to the broader research community.

Introduction

Vision foundation models such as CLIP, DINOv2, and SynCLR have demonstrated exceptional generalization abilities across various computer vision tasks, attributable to extensive pre-training on vast and diverse datasets. However, the primary obstacles for the wider adoption and further development of these models are the requirements of immense computational resources and inaccessible training datasets. Traditionally renowned datasets such as ImageNet-1K are considered relatively small in scale, rendering them less frequently utilized in the era of massive foundation models.

Methodology

Proteus addresses these challenges by proposing a knowledge distillation framework that operates effectively on smaller datasets like ImageNet-1K. The key innovations in Proteus are the elimination of conventional knowledge distillation designs that introduce dataset bias and the introduction of multi-level training objectives—namely token-level, patch-level, and feature-level—to enhance the efficacy of knowledge transfer.

Proxy Dataset Selection

A notable strength of Proteus is its robustness to dataset selection. By focusing on publicly available resources such as ImageNet-1K, the framework effectively mitigates dataset bias, enabling the distilled models to generalize better to unseen data. This is achieved by discarding the one-hot label-based Cross-Entropy loss in favor of distillation on intermediate features, thus diminishing the risk of overfitting to specific dataset distributions.

Proxy Task Construction

Proteus constructs a comprehensive proxy task by incorporating multiple levels of learning objectives:

  1. Token-level Objective: This involves aligning the classification token of the student model with that of the teacher through L2 distance minimization.
  2. Patch-level Objective: Inspired by masked image modeling techniques, this objective helps the model learn more generalized features by reconstructing masked patches.
  3. Feature-level Objective: To support dense prediction tasks, the framework minimizes the L2 distance between the intermediate features of the teacher and student models.

Empirical Validation

The strength of Proteus is empirically validated under various experimental setups:

  1. ImageNet Linear Evaluation: The results demonstrate that Proteus surpasses CLIP and other self-supervised learning models, including those pre-trained on significantly larger datasets.
  2. Fine-grained Classification: On 12 fine-grained classification benchmarks, Proteus matches or outperforms models trained on extensive datasets.
  3. Dense Prediction Tasks: The framework shows superior or comparable performance on tasks like semantic segmentation and depth estimation.

Scalability and Generalization

Proteus exhibits remarkable scalability, performing effectively even when the model size is increased. Distilling from larger models such as DINOv2-L/g, Proteus maintains competitive accuracy and generalization abilities, validating its design principles.

Moreover, when leveraging foundation models trained with different learning objectives and on varied datasets—such as CLIP and SynCLR—Proteus achieves comparable performance, underscoring its robustness and adaptability.

Practical and Theoretical Implications

Proteus promises significant implications for both practical applications and theoretical explorations in AI:

  • Practical Accessibility: It enables researchers with limited resources to leverage state-of-the-art vision foundation models effectively.
  • Enhanced Generalizability: By mitigating dataset bias and incorporating multi-level learning objectives, Proteus-derived models generalize well across various tasks and datasets.
  • Foundation Model Comprehension: The framework provides insights into the nature of knowledge transfer from large foundation models to smaller, task-specific models.

Conclusion

The Proteus framework presents a compelling solution for democratizing access to vision foundation models. By circumventing the need for vast datasets and extensive computational resources, it propels forward the research community's capabilities. Future research may focus on extending these principles to other modalities, such as natural language processing and multimodal models, further fostering the development and accessibility of foundation models across various research domains.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.