PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Published 24 Nov 2021 in cs.CV and cs.LG | (2111.12710v3)

Abstract: This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment.This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity.We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve $\textbf{84.5\%}$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by $\textbf{+1.3\%}$ under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (\textbf{88.3\%}) among methods using only ImageNet-1K data.

Abstract PDF Upgrade to Chat

Authors (10)

Citations (222)

View on Semantic Scholar

Summary

The paper introduces a perceptual codebook strategy that leverages deep perceptual features to generate semantically rich visual tokens.
It refines traditional VQ-VAE pre-training by integrating a perceptual loss to minimize feature discrepancies rather than pixel-wise errors.
Empirical evaluations show significant performance gains on benchmarks like ImageNet-1K, COCO, and ADE20K, demonstrating enhanced model efficacy.

An Expert Review of "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers"

The paper "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers" presents an innovative approach in the domain of self-supervised learning for vision transformers. This work proposes a perceptual codebook strategy to enhance the masked image modeling (MIM) process within BERT-like frameworks, emphasizing the alignment of prediction targets with human perceptual judgments.

Core Contributions

The principal contribution of this study lies in its proposition of a perceptually guided approach to derive discrete visual tokens that serve as prediction targets for vision transformer pre-training. The authors identify a discrepancy between current MIM prediction methodologies and human perception, attributable to the limitations of pixel-wise loss functions which fail to encapsulate structured visual outputs. To address this, the model employs deep perceptual features obtained via a self-supervised Transformer model as a novel perceptual loss in the training of Vector Quantized Variational Autoencoder (VQ-VAE). This process aims to bind perceptually similar images to remain proximate within the prediction target space.

Methodological Insight

The methodological backbone involves enhancing the traditional VQ-VAE framework by incorporating a perceptual loss that leverages multi-scale features extracted from a self-supervised ViT-B model. The essence of this perceptual approach is its ability to minimize feature-wise (as opposed to pixel-wise) discrepancies, thereby generating visual tokens with augmented semantic content compared to those generated via typical reconstruction losses.

This perceptual tokenization strategy is employed within a masked image modeling paradigm where the neural network learns to predict perceptually derived visual tokens for masked portions of an image. The paper methodically scales this approach across different model sizes, demonstrating the utility and scalability of PeCo through superior performance in various downstream tasks.

Empirical Results

The empirical evaluation exhibits substantial gains over existing methodologies, including BEiT and MAE, in the context of ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation. Specifically, the proposed PeCo method achieves a notable top-1 accuracy of 84.5% on ImageNet-1K with a ViT-B backbone, a clear improvement over BEiT. Similar improvements are noted in object detection and segmentation tasks, attesting to the gains in pre-training efficacy attributed to perceptual tokenization.

Analysis and Implications

A critical analysis in the paper explores the impact of different perceptual feature sources, establishing that both CNN-based and Transformer-based deep features result in comparable gains. The study underscores that while perceptual loss distinctly elevates semantic richness in visual tokens, excessive weightage can detract from local detail fidelity, suggesting an optimal balance in loss constitution.

From a theoretical perspective, this work suggests a pivotal shift in the design of self-supervised vision models by advocating for perceptual coherence in prediction targets. Practically, the findings encourage the integration of perceptual metrics in pre-training paradigms, paving the way for more semantically aware and context-intuitive vision transformers.

Future Directions

The paper opens several avenues for future exploration. Extending perceptual criterion-based tokenization technologies into broader modalities with multimodal datasets, and exploring real-world deployment scenarios for perceptually guided models in downstream computer vision tasks, are salient prospects. Moreover, expanding upon adaptive perceptual metrics for varied data types and architectures might provide further enhancements in model adaptability and performance.

In conclusion, "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers" presents a cogent advancement in vision transformer pre-training strategies that elevates performance through perceptual alignment. The methodological components and empirical evaluations synergize robustly, marking it as a significant contribution to the domain of self-supervised visual representation learning.