SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Published 1 Jun 2023 in cs.CV, cs.AI, and cs.LG | (2306.00980v3)

Abstract: Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

Abstract PDF HTML Upgrade to Chat

References (64)

Citations (112)

View on Semantic Scholar

Summary

The paper introduces SnapFusion, a mobile-optimized text-to-image diffusion model that generates high-quality images in under two seconds.
It employs an efficient UNet architecture, a compressed VAE decoder, and CFG-aware step distillation to minimize latency and preserve image fidelity.
Experimental results on the MS-COCO dataset show superior FID and CLIP scores compared to Stable Diffusion v1.5 using only 8 denoising steps.

SnapFusion: A Mobile-Optimized Text-to-Image Diffusion Model

The research paper presents SnapFusion, an innovative advancement in text-to-image diffusion models specifically engineered to operate on mobile devices with striking efficiency. Achieving image generation in under two seconds, SnapFusion addresses the computational and privacy challenges inherent in traditional text-to-image diffusion models which typically require high-end GPUs and cloud-based processing.

Contributions and Methodology

The study introduces significant architectural optimizations and novel strategies for step distillation to facilitate swift on-device inference. The central contributions of the paper are outlined below:

Efficient UNet Architecture: The authors identify and alleviate redundancy in the original UNet architecture—serving as the backbone of their diffusion model—through a robust training and evaluation mechanism. The UNet is optimized to significantly reduce computational latency while maintaining image generation quality.
Network Architecture Evolving Framework: A novel framework is proposed to systematically evolve the network architecture. This involves a robust stochastic training approach coupled with an evolutionary algorithm to prune architecture redundancies effectively, thus improving inference speed.
Compressed VAE Decoder: To further accelerate the image decoding process, a data distillation approach is employed, compressing the VAE decoder with negligible impact on visual quality. This involves a thoughtful design of a distillation pipeline using synthetic latent-image pairs to minimize computational overhead.
CFG-Aware Step Distillation: Enhancing step distillation by integrating classifier-free guidance (CFG), the model reduces the necessary denoising iterations while sustaining image fidelity. This innovation is crucial in minimizing latency by facilitating a model that performs comparably to its 50-step counterpart with only 8 denoising steps.

Numerical Outcomes

Experimental validation on the MS-COCO dataset indicates that SnapFusion's performance exceeds that of Stable Diffusion v1.5, achieving superior FID and CLIP scores despite being executed with reduced computational resources. Notably, with just 8 denoising steps, SnapFusion outperforms the baseline 50-step configuration in terms of image-text alignment as quantified by the CLIP score.

Implications and Future Directions

SnapFusion represents a leap forward in democratizing creative content generation by delivering powerful diffusion models to the user’s palm. The implications for practical applications span various domains, including interactive digital content and real-time artistic rendering on consumer devices. The paper paves the way for additional inquiries into efficient architecture search and distillation methodologies, with potential extensions to other domains such as video synthesis or 3D content creation.

Future research could explore the further miniaturization of these models to fit diverse mobile hardware or enhance model adaptability for varied stylistic attributes. As the demand for efficient, high-quality on-device AI models intensifies, SnapFusion provides a blueprint for effectively overcoming the latency constraints of large-scale machine learning models, ensuring broad accessibility without compromising data privacy.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (9)

Collections

YouTube

Show All Videos

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Summary

SnapFusion: A Mobile-Optimized Text-to-Image Diffusion Model

Contributions and Methodology

Numerical Outcomes

Implications and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (9)

Collections

YouTube