Emergent Mind

Denoising Vision Transformers

(2401.02957)
Published Jan 5, 2024 in cs.CV

Abstract

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields in a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean features for offline applications. Expanding the scope of our solution to support online functionality, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, which shows remarkable generalization capabilities to novel data without the need for per-image optimization. Our two-stage approach, termed Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs and is immediately applicable to any Transformer-based architecture. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.

DVT's denoiser improves object discovery by enhancing feature norms and reducing artifacts.

Overview

  • Vision Transformers (ViTs) have advanced image-related AI tasks but suffer from persistent noise artifacts.

  • The study identifies positional embeddings as the source of noise artifacts in ViTs.

  • A Denoising Vision Transformers (DVT) approach is proposed, utilizing a two-stage solution for artifact removal.

  • DVT significantly improves performance across various tasks without requiring retraining of ViTs.

  • The research calls for a re-evaluation of ViT design and offers immediate improvements for existing models.

Introduction

In the ever-evolving landscape of artificial intelligence, Vision Transformers (ViTs) have risen as a prominent architecture for image-related tasks. Despite their state-of-the-art performance, a recent study underlines a critical issue with these models: the presence of persistent noise artifacts in their outputs. These artifacts not only impair the aesthetic quality of the generated images but also affect the models' performance on downstream tasks by disturbing feature interpretability and semantic coherence.

Uncovering the Issue

A deep dive into the cause of these artifacts reveals their association with positional embeddings incorporated at the initial stages of the ViT architecture. These embeddings are meant to provide the model with spatial cues but unfortunately contribute to the artifact problem. Through an analytical approach, the researchers show how ViTs carry these artifacts in their outputs, regardless of input variations, and establish a consistent presence of these issues across numerous pre-trained models.

A Novel Denoising Approach

To address this challenge, the researchers propose an innovative two-stage solution, known as Denoising Vision Transformers (DVT). The first stage involves a universal noise model for ViT outputs, which categorizes the output into three factors: a noise-free semantic piece, and two other terms associated with position-based artifacts. This is done using neural fields to ensure cross-view consistency of features on a per-image basis. Offline applications can utilize the resulting artifact-free features created through this optimization.

Moving to the second stage, for applications that demand online functionality, a lightweight denoiser model is trained to predict these clean features from the raw ViT outputs directly. The denoiser is a single Transformer block and can be smoothly integrated into existing ViTs without the need for retraining. This stage enables real-time applications and ensures the denoiser's efficacy on new, unseen data.

Efficacy and Applications

The proposed DVT approach has undergone extensive evaluation on a variety of ViTs, highlighting its ability to notably enhance their performance across semantic and geometric tasks without re-training. The improvements in performance metrics, such as the mean Intersection over Union (mIoU), were significant. In practice, DVT's utility is evident as it can be readily applied to any existing Transformer-based architecture, extending its promise to various applications in image processing and computer vision.

Conclusion

This study motivates a re-assessment of the design choices in ViT architectures, especially the naïve implementation of positional embeddings. It provides a robust framework for extracting artifact-free features from ViT outputs and improves the quality and reliability of features used in downstream vision tasks. The adaptive nature of DVT paves the way for immediate enhancements in pre-trained models and promises an artifact-free future for Vision Transformers.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube