Emergent Mind

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

(2301.08243)
Published Jan 19, 2023 in cs.CV , cs.AI , cs.LG , and eess.IV

Abstract

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

I-JEPA predictor captures high-level objects with correct pose, discarding low-level details and background information.

Overview

  • The paper introduces the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a novel self-supervised learning method that learns high-level semantic image representations without hand-crafted data augmentations.

  • I-JEPA uses a predictive framework with a multi-block masking strategy to predict the representations of various target blocks within an image from a single context block, enhancing its ability to learn semantic features.

  • Empirical evaluations demonstrate that I-JEPA outperforms existing methods in high-level semantic tasks and low-level vision tasks while being more computationally efficient and scalable.

An Analysis of the Image-based Joint-Embedding Predictive Architecture (I-JEPA)

The paper presents a novel approach to self-supervised learning from images by introducing the Image-based Joint-Embedding Predictive Architecture (I-JEPA). The primary contribution of this work is to learn high-level semantic image representations without the need for hand-crafted data augmentations, a typical requirement in existing methods.

Key Contributions

  1. Innovative Predictive Framework: I-JEPA aims to predict the representations of various target blocks within an image from a single context block. This differentiates it from generative methods that predict in pixel space and enhances the ability to learn semantic features.
  2. Masking Strategy: The authors propose a sophisticated multi-block masking strategy. The context block is sufficiently informative and spatially distributed, while the target blocks are large enough to encapsulate semantic information.
  3. Scalability and Efficiency: When coupled with Vision Transformers (ViTs), I-JEPA demonstrates impressive scalability. For instance, a ViT-Huge/14 model can be trained on ImageNet using 16 A100 GPUs in under 72 hours, outperforming prior methods in terms of computational efficiency.

Empirical Evaluations

The empirical evaluations highlight several strengths of I-JEPA:

  • Semantic Representations: I-JEPA outperforms pixel-reconstruction methods such as Masked Autoencoders (MAE) in linear probing, semi-supervised learning, and transfer tasks. It avoids the computational overhead and biases introduced by hand-crafted augmentations.
  • Versatility: Besides performing well on high-level semantic tasks, I-JEPA also exhibits superior performance on low-level vision tasks like object counting and depth prediction, proving its broader applicability across diverse tasks.
  • Efficiency: Pre-training with I-JEPA on a ViT-H/14 model requires less than 1,200 GPU hours, making it over 2.5 times faster than iBOT and over 10 times more efficient than MAE.

Detailed Analysis

Joint-Embedding Predictive Architectures (JEPAs) vs. Other Architectures

  1. Joint-Embedding Architectures (JEAs):

    • Aim to produce similar embeddings for compatible inputs, typically suffering from representation collapse, which necessitates architectural asymmetry or additional techniques to manage.
  2. Generative Architectures:

    • Focus on reconstructing a signal from a compatible one, often in pixel space. Although straightforward, these methods do not generally produce highly semantic representations.
  3. JEPAs:

    • Learn to predict embeddings in representation space rather than directly reconstructing the signal. This approach mitigates the challenges of representation collapse while ensuring the predictions are detailed enough to be semantically rich.

Method

The core methodological contributions include a Vision Transformer-based architecture with distinct components:

  • Context Encoder: Processes the context block of an image, excluding overlapping target blocks.
  • Predictor: Utilizes the output of the context encoder and positional mask tokens to predict target block representations.
  • Target Encoder: Provides abstracted targets by encoding the entire image and generating masked outputs to guide the predictor.

The training involves minimizing the average $L_2$ distance between predicted and actual representations of image blocks, with the target encoder’s parameters updated via an exponential moving average.

Results on Image Classification

  1. ImageNet-1K:

    • I-JEPA significantly improves linear probing performance over prior approaches like MAE and data2vec. ViT-H/16 models trained with I-JEPA match or surpass such methods without relying on hand-crafted augmentations.
  2. Low-Shot ImageNet-1K:

    • Demonstrates superior performance using just 1% of the available labels, outperforming methods requiring extensive data augmentations.
  3. Transfer Learning:

Local Prediction Tasks

I-JEPA also excels in object counting and depth prediction, outperforming generative and view-invariance based methods. This highlights its capability to capture both high-level semantic and low-level structural details, catering to a diverse set of vision tasks.

Scalability

The architecture’s efficiency and scalability are prominently showcased:

  • Computational Efficiency: Requires fewer epochs than MAE and processes single views more efficiently than iBOT.
  • Data and Model Scaling: Benefits from pretraining on larger datasets (e.g., ImageNet-22K) and larger model sizes (e.g., ViT-G/16), leading to improved performance across tasks.

Implications and Future Directions

I-JEPA’s ability to produce semantic image representations without hand-crafted augmentations may considerably impact the future of self-supervised learning. Its efficiency and versatility suggest it can be adapted for various modalities beyond images, like audio and text. Future research may explore leveraging this architecture across larger, more heterogeneous datasets, and fine-tuning for specific downstream applications in computer vision and potentially other fields.

In conclusion, I-JEPA presents a significant advancement in self-supervised learning, offering a scalable, efficient, and versatile framework that reduces reliance on manual data augmentations while delivering superior semantic representations.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube