FiT: Flexible Vision Transformer for Diffusion Model (2402.12376v4)

Published 19 Feb 2024 in cs.CV

Abstract: Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.

References (45)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces a flexible training pipeline that preserves image aspect ratios by dynamically resizing images, eliminating cropping distortions.
The paper employs 2D Rotary Positional Embedding and SwiGLU within its novel transformer architecture to adeptly handle variable image sizes and maintain efficiency.
The paper demonstrates superior resolution extrapolation by generating high-quality images beyond the training distribution, setting new benchmarks on ImageNet.

Flexible Vision Transformer for Unrestricted Resolution Image Generation

Introduction

In the evolving landscape of image generation, the quest for models that generalize across arbitrary resolutions is paramount. The recently introduced Flexible Vision Transformer (FiT) emerges as a significant advancement in this direction, fundamentally altering the way images are perceived and generated. By conceptualizing images as sequences of dynamically-sized tokens, FiT transcends the limitations of fixed dimensionality, heralding a new era of resolution-independent image synthesis.

Core Contributions

FiT introduces several innovative design elements, each contributing to its exceptional performance:

Flexible Training Pipeline: This approach allows for the preservation of original image aspect ratios by dynamically resizing images to fit within a predefined token limit, thereby eliminating the need for cropping or disproportionate scaling.
Novel Transformer Architecture: At its core, FiT incorporates 2D Rotary Positional Embedding (RoPE) and Swish-Gated Linear Unit (SwiGLU), enabling the model to adeptly handle variable image sizes and maintain efficiency across varying resolutions.
Resolution Extrapolation Method: Leveraging techniques from LLMs, FiT introduces a training-free extrapolation method, allowing for the generation of images at resolutions beyond those encountered during training.

Experimental Insights

FiT exhibits remarkable versatility and performance across a broad spectrum of resolutions, as evidenced by strict experimental evaluations. Notably, at higher resolutions and aspect ratios substantially different from the training distribution, FiT's capabilities shine, outperforming state-of-the-art models by significant margins. For instance, in class-conditional image generation on the ImageNet dataset, FiT achieved leading FID scores at various resolutions, setting new benchmarks for image synthesis quality.

Architectural Innovations

A key aspect of FiT's success is its architectural improvements over predecessors. The replacement of MHSA with Masked MHSA and the transition from MLP to SwiGLU, coupled with the adoption of 2D RoPE, collectively enhance the model's flexibility and efficiency. These architectural choices enable FiT to adeptly manage variable-length sequences and generate high-quality images across a diverse range of resolutions and aspect ratios.

Extrapolation Capabilities

FiT's robust resolution extrapolation process facilitates image generation beyond the confines of the training distribution. Through innovative interpolation methods inspired by advancements in LLMs, such as VisionNTK and VisionYaRN, FiT adeptly synthesizes images at unprecedented resolutions, showcasing its potent extrapolation capabilities. This allows FiT to effectively adapt to and excel at producing images with arbitrary resolutions and aspect ratios, a feat not readily achievable by previous models.

Future Directions

The introduction of FiT represents a significant step forward in the domain of image generation, particularly in the context of resolution and aspect ratio flexibility. Looking ahead, FiT's versatile architecture and innovative methodologies offer a promising foundation for further research and development. Potential future directions include the exploration of FiT's applicability to other domains beyond image generation, refinement of its extrapolation methods for even greater efficiency, and adaptation to leverage emerging computational paradigms.

In summary, the FiT model substantiates the feasibility of generating high-quality images across a vast spectrum of resolutions and aspect ratios, effectively addressing a longstanding challenge in the field. Its comprehensive design, coupled with exceptional performance, positions FiT as a pivotal model for future explorations in generative image synthesis.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1759826571135733843

https://twitter.com/dihuang52453419/status/1759788156428407032

HackerNews

Fit: Flexible Vision Transformer for Diffusion Model (3 points, 2 comments)
Fit: Flexible Vision Transformer for Diffusion Model (1 point, 0 comments)