Emergent Mind

Abstract

Diffusion models have achieved great success in image generation, with the backbone evolving from U-Net to Vision Transformers. However, the computational cost of Transformers is quadratic to the number of tokens, leading to significant challenges when dealing with high-resolution images. In this work, we propose Diffusion Mamba (DiM), which combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. To address the challenge that Mamba cannot generalize to 2D signals, we make several architecture designs including multi-directional scans, learnable padding tokens at the end of each row and column, and lightweight local feature enhancement. Our DiM architecture achieves inference-time efficiency for high-resolution images. In addition, to further improve training efficiency for high-resolution image generation with DiM, we investigate ``weak-to-strong'' training strategy that pretrains DiM on low-resolution images ($256\times 256$) and then finetune it on high-resolution images ($512 \times 512$). We further explore training-free upsampling strategies to enable the model to generate higher-resolution images (e.g., $1024\times 1024$ and $1536\times 1536$) without further fine-tuning. Experiments demonstrate the effectiveness and efficiency of our DiM.

Framework: noisy image/latent inputs processed and transformed through convolution and Mamba blocks for noise prediction.

Overview

  • Diffusion Mamba (DiM) addresses the high computational cost of Vision Transformers in high-resolution image synthesis by leveraging the efficiency of the Mamba sequence model, which is rooted in State Space Models (SSM).

  • DiM introduces key architectural adjustments like multi-directional scans, learnable padding tokens, and local feature enhancement to make Mamba effective for image generation, and employs a 'weak-to-strong' training strategy to reduce training time.

  • Experimental results show that DiM performs comparably to state-of-the-art models on datasets like CIFAR-10 and ImageNet, particularly excelling in high-resolution tasks above 1280x1280 pixels due to its linear scalability.

Diffusion Mamba (DiM): An Efficient Approach to High-Resolution Image Synthesis

Introduction

Diffusion models have been making waves in the realm of image generation, especially with the advances brought about by Vision Transformers. However, the computational cost of using transformers scales quadratically with the number of tokens, which becomes a significant bottleneck when dealing with high-resolution images. Enter Diffusion Mamba (DiM), a new approach to high-resolution image synthesis that leverages the efficiency of Mamba, a sequence model rooted in State Space Models (SSM). This article breaks down the key ideas and innovations presented in the DiM approach, along with their practical implications and experimental results.

The Problem with High-Resolution Transformers

While Vision Transformers have proven effective in image generation, their computational cost is an issue. Transformers operate on image tokens, and the self-attention mechanism within these models scales quadratically. As image resolution increases, so does the number of tokens, making high-resolution tasks computationally expensive.

Introducing Mamba for Image Synthesis

Mamba, a model based on SSMs, offers a solution with its linear computational complexity concerning the number of tokens, compared to the quadratic complexity of transformers. Mamba has shown efficiency across various domains, such as language and audio, and it brings the same promise to high-resolution image generation.

Key Innovations in Diffusion Mamba (DiM)

To make Mamba effective for image generation, several architectural adjustments were proposed:

  1. Multi-Directional Scans: To overcome the unidirectional limitations of Mamba in handling 2D data, DiM employs multi-directional scans. This ensures that each token can access a global receptive field, improving the model's ability to capture spatial information across the entire image.
  2. Learnable Padding Tokens: To address discontinuities caused by flattening the image into a 1D sequence, DiM uses learnable padding tokens. These tokens help the model recognize the boundaries within the image, preserving the spatial structure.
  3. Local Feature Enhancement: DiM incorporates lightweight depth-wise convolution layers at the input and output layers, enhancing the local coherence of the generated images.

Training Strategies for High-Resolution Images

High-resolution training is resource-intensive. DiM introduces a "weak-to-strong" training strategy — pretraining on low-resolution images (256x256) and fine-tuning on higher resolutions (512x512). This approach significantly reduces training time and computational costs.

Additionally, DiM explores training-free upsampling strategies, allowing the model to generate even higher resolutions (up to 1536x1536) without additional fine-tuning, utilizing techniques like upsample guidance during the initial diffusion steps.

Experimental Results

The effectiveness and efficiency of DiM were validated through extensive experiments on datasets like CIFAR-10 and ImageNet:

  • On CIFAR-10: DiM-Small achieved an impressive FID score of 2.92, demonstrating its capability even with constrained resources.
  • On ImageNet: DiM-Huge, pretrained at 256x256 resolution and fine-tuned at 512x512, showed comparable performance to other state-of-the-art transformer-based models. Notably, it achieved this with significantly fewer iterations and less training data.

Moreover, DiM demonstrated the capability to generate high-quality, high-resolution images (1024x1024 and 1536x1536) using training-free upsampling.

Efficiency Analysis

A comparative analysis of inference speed reveals that while DiM may be slightly slower at lower resolutions, it outperforms transformer-based models at very high resolutions (above 1280x1280). The linear scalability of Mamba makes DiM more efficient for those large-scale tasks, crucial for practical applications demanding high-resolution outputs.

Practical Implications and Future Directions

DiM offers a promising approach to high-resolution image synthesis, balancing computational efficiency and model performance. Its architectural innovations and training strategies provide a framework that can adapt to increasingly higher resolutions without proportional increases in computational cost.

Future directions may explore further optimization of the Mamba backbone and the potential integration with other efficient model architectures. Additionally, extending DiM's principles to new generative tasks beyond image synthesis could broaden its application.

Conclusion

Diffusion Mamba represents a step forward in making high-resolution image synthesis more computationally feasible. By marrying the efficient Mamba backbone with innovative architectural designs, DiM achieves impressive performance, particularly in high-resolution scenarios. This work not only showcases a practical solution to a pressing problem but also opens up new avenues for efficient generative models in the future.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.