Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

370

Learning and Leveraging World Models in Visual Representation Learning (2403.00504v1)

Published 1 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

References (58)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces Image World Models (IWM) as a novel self-supervised approach that trains a predictor to model transformation effects in latent space.
It employs a Vision Transformer encoder and a Transformer-based predictor conditioned on transformation parameters to learn robust equivariant representations.
The learned predictor can be finetuned for downstream tasks, outperforming standard encoder finetuning in ImageNet classification and semantic segmentation.

This paper introduces Image World Models (IWM), a self-supervised learning approach based on the Joint-Embedding Predictive Architecture (JEPA) framework (2403.00504). IWM aims to learn not only high-quality visual representations but also a reusable "world model" capable of predicting the effects of transformations in latent space. Unlike traditional self-supervised methods where the predictor or decoder is often discarded after pretraining, IWM focuses on leveraging this learned world model for downstream tasks.

Core Idea: Learning and Leveraging Image World Models

The central concept is to train a predictor network (the world model) to predict the latent representation of a target view (y) given the latent representation of a transformed source view (x) and information about the transformation (a) applied to get from x to y.

Input Views:
- Target (y): Generated by applying standard augmentations (random crop, horizontal flip, moderate color jitter) to an image I. Destructive augmentations (like grayscale) are avoided to maximize information content.
- Source (x): Starts from the target y and applies further, potentially stronger transformations including color jitter, destructive augmentations (grayscale, blur, solarization), and patch masking (removing 4 rectangular blocks).
Action (a): Represents the parameters of the transformation needed to reverse the process from x back to y (e.g., color jitter differences, flags for destructive augmentations).
Architecture:
- Encoder (f_theta): A Vision Transformer (ViT-B/16 used in experiments) encodes the source view x into latent representation z_x.
- Target Encoder (f_theta^EMA): An exponential moving average (EMA) of the encoder weights encodes the target view y into z_y. This EMA target encoder is crucial for stability and preventing collapse.
- Predictor (p_phi): The world model, typically a Transformer architecture with adjustable depth and width. It takes the encoded source patches z_x, the transformation parameters a, and positional embeddings for the target patches (m_a) as input.
Objective: The predictor p_phi aims to predict the target representation z_y for the masked patches. The loss is the squared L2 distance between the prediction $\hat{z_y} = p_\phi(z_x, a, m_a)$ and the actual target z_y, summed over the target patch indices:

$L(x,y) = \sum_{i\in M_x^C}\| p_\phi\left(f_\theta(x),a_{x\rightarrow y},m_a \right)_i - f_\theta^\text{EMA}(y)_i \|_2^2$

where $M_x^C$ denotes the indices of the target patches (complement of the source mask).

Key Factors for Learning a Strong (Equivariant) World Model

The paper identifies three critical aspects for training a capable world model that can accurately predict transformation effects (termed "equivariant"):

Predictor Conditioning: The predictor must be conditioned on the transformation parameters a. Without conditioning, the model defaults to learning invariant representations (like BYOL/SimSiam). The paper finds "feature conditioning" (concatenating transformation parameters with mask token embeddings and processing through MLPs) works well.
Transformation Complexity: The prediction task needs to be sufficiently difficult. Using strong augmentations (significant color jitter, destructive augmentations like grayscale, blur) forces the predictor to learn meaningful modeling capabilities. Easy transformations don't necessitate a powerful world model.
Predictor Capacity: The predictor (p_phi) needs adequate capacity (depth and width) to model complex transformations. A deeper predictor (e.g., 18 layers vs. 12) is shown to learn equivariance more reliably across different augmentation strengths.

World model quality is evaluated using Mean Reciprocal Rank (MRR), measuring how well the predictor can identify the correct transformed target representation among a bank of distractors.

Leveraging the World Model: Predictor Finetuning

A key contribution is demonstrating that the pretrained predictor (world model) can be effectively reused for downstream tasks, offering an alternative to standard encoder finetuning.

Protocol: Freeze the pretrained encoder (f_theta^EMA, the teacher network performs slightly better). Attach a task-specific head (e.g., linear layer for classification, UperNet head for segmentation) to the output of the pretrained predictor (p_phi). Finetune only the predictor and the task head. The predictor is tasked with predicting a clean, untransformed version of the input (using null transformation parameters a).
Performance:
- Finetuning the pretrained IWM predictor significantly outperforms finetuning a randomly initialized predictor of the same size, especially for equivariant models (e.g., +1.8% ImageNet Top-1 for $\text{IWM}_{18,384}^\text{Equi}$ ). This confirms the predictor learned useful transferable knowledge.
- Predictor finetuning performance can match or exceed encoder finetuning performance (e.g., $\text{IWM}_{18,384}^\text{Equi}$ predictor finetuning: 83.3% vs. its encoder finetuning: 82.9%; $\text{IWM}_{12,384}^\text{Inv}$ encoder finetuning: 83.3%).
- It's significantly more parameter-efficient than encoder finetuning, as only the (often smaller) predictor is updated (Figure 2).
- End-to-end finetuning (both encoder and predictor) yields the highest performance (e.g., 84.4% for $\text{IWM}_{18,384}^\text{Equi}$ ).
- Similar trends hold for semantic segmentation on ADE20k (Table 4).
Multitask Tuning: Inspired by instruction tuning, the predictor can be finetuned for multiple tasks simultaneously by adding learned task-specific tokens as input. A single predictor achieves comparable average performance to separately finetuned predictors, amortizing the parameter cost across tasks (Table 5).

Representation Properties: Abstraction Spectrum

IWM allows controlling the properties of the learned representations by modulating the world model's capability:

Invariant World Model: Achieved with weaker conditioning, simpler transformations, or lower predictor capacity. The predictor cannot fully invert transformations, forcing the encoder to learn representations invariant to those transformations (by discarding information). These representations are more abstract/semantic, perform well in linear probing (similar to contrastive methods like MoCo v3), but may have lower peak performance with complex heads. (e.g., $\text{IWM}_{12,384}^\text{Inv}$ ).
Equivariant World Model: Achieved with strong conditioning, complex transformations, and high predictor capacity. The predictor learns to model transformations accurately, allowing the encoder to retain more detailed information about the input. These representations are richer, perform better with more powerful adaptation methods like predictor finetuning or attentive probing (similar to MIM methods like MAE), and show better OOD generalization (Appendix Table S6). (e.g., $\text{IWM}_{18,384}^\text{Equi}$ ).

IWM spans the spectrum between highly abstract (contrastive) and highly detailed (MIM) representations, offering flexibility based on downstream needs.

Implementation Considerations

Architecture: ViT-B/16 encoder, Transformer predictor (e.g., 12-18 layers, 384-dim embeddings). Appendix suggests predictor parameters ~30% of encoder parameters is a good starting point for scaling.
Pretraining: 300 epochs on ImageNet-1k. AdamW optimizer, cosine LR schedule (1e-3 peak), cosine weight decay schedule (0.04->0.4). Batch size 1024. Asymmetric augmentations (stronger on source x) are generally preferred.
Predictor Finetuning: 100 epochs. AdamW, cosine LR schedule (1e-3 peak, divided by 10 for pretrained predictor), WD 0.1. Use teacher encoder, null transformation inputs. Attach attentive head for classification.
Encoder Finetuning: 100 epochs. Follows MAE protocol (RandAugment, MixUp, CutMix, AdamW, Layer-wise LR decay).
Evaluation: Linear probing (90 epochs, LARS), Attentive probing (90 epochs, AdamW), Finetuning (100 epochs), Segmentation (ADE20k, UperNet head).

In summary, IWM presents a framework for learning visual representations by explicitly training a world model within a JEPA structure. It highlights the importance of conditioning, transformation complexity, and predictor capacity. Crucially, it shows that this learned world model is not just a training artifact but can be effectively finetuned for downstream tasks, offering a competitive and parameter-efficient alternative to encoder finetuning, with the added benefit of multitask capability and control over representation abstraction.

PDF Markdown

Tweets

https://twitter.com/arankomatsuzaki/status/1764510415097114783

https://twitter.com/DianboLiu/status/1768291466730954860

https://twitter.com/fly51fly/status/1764607312310972682

https://twitter.com/nembal/status/1846589466955165838

https://twitter.com/rasbt/status/1772650332394311874

https://twitter.com/Ethan_smith_20/status/1786285962533753145