MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

Published 28 Nov 2023 in cs.CV | (2311.16567v2)

Abstract: The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (62)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces an optimized UNet architecture that reduces redundancy and computational cost for mobile inference.
The paper implements advanced sampling techniques, including progressive distillation and a diffusion-GAN hybrid, to achieve sub-second image generation.
The empirical validation, with an FID of 9.01 on an iPhone 15 Pro, demonstrates the practical feasibility of AI on resource-constrained devices.

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

In "MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices," Zhao et al. address the significant challenge of deploying large-scale text-to-image diffusion models on mobile devices due to their substantial model size and slow inference speed. The proposed solution, MobileDiffusion, introduces a highly efficient text-to-image diffusion model optimized through comprehensive architectural and sampling technique improvements. This paper offers valuable insights into enabling state-of-the-art text-to-image generation within the constraints of mobile computing environments.

Summary of Contributions

The paper provides multiple key contributions:

Efficient Model Architecture: The authors investigate and optimize the UNet-based architecture commonly used in diffusion models. They introduce modifications to reduce redundancy, enhance computational efficiency, and minimize model parameters.
Advanced Sampling Techniques: The paper combines advanced numerical solvers and distillation techniques to significantly reduce the number of sampling steps required for image generation.
Empirical Validation: Through extensive empirical studies, both quantitative and qualitative, the authors demonstrate that MobileDiffusion achieves sub-second inference speeds for generating high-quality images on mobile devices.

Architecture Optimization

The inefficiency of text-to-image diffusion models stems from the need for iterative denoising and the complex network architecture involving a high number of parameters. The authors address these issues with a detailed examination of the UNet architecture. Key optimizations include:

Transformer and Convolutional Block Reorganization: They investigate the role of transformer blocks and advocate for selective removal of self-attention layers at high resolutions while retaining cross-attention. This approach maintains model performance while enhancing efficiency.
Activation and Parameter Sharing: Replacing $\mathsf{gelu}$ with $\mathsf{swish}$ and sharing parameters between attention layers reduces computational costs without quality degradation.
Lightweight Convolutions: Adopting separable convolutions in deeper network sections further reduces parameter count and enhances runtime efficiency.

These optimizations culminate in a model architecture boasting fewer than 400 million parameters and substantial gains in computational efficiency.

Sampling Efficiency

To further enhance the model's deployment feasibility on mobile devices, the authors implement:

Progressive Distillation: By recursively applying distillation techniques, MobileDiffusion reduces the required sampling steps to as few as eight, preserving image quality and reducing inference time.
Diffusion-GAN Hybrid: Utilizing the UFOGen approach, the model is fine-tuned with a hybrid objective, enabling inferences in a single step without significant quality loss.

Empirical Results

Empirical validation demonstrates MobileDiffusion’s capabilities. The model achieves a Fréchet Inception Distance (FID) of 9.01 with eight steps, comparable to larger and slower models. The resulting image quality, measured by the CLIP score, and visual inspections validate the effectiveness of architectural and sampling optimizations.

Quantitative comparisons with other state-of-the-art text-to-image models underscore MobileDiffusion's efficiency. The demonstration on mobile devices, specifically achieving sub-second inference on an iPhone 15 Pro, establishes a new benchmark in mobile text-to-image generation.

Practical and Theoretical Implications

The practical implications of this research are profound, offering a pathway for deploying high-quality generative models on resource-constrained devices. This advancement opens up numerous applications, from real-time image editing and augmented reality to personalization features in mobile applications. Theoretically, the approach sets a precedent for future research in optimizing large-scale generative models for edge devices, highlighting the trade-offs between architectural complexity, parameter count, and inference efficiency.

Future Directions

Anticipated future developments include extending these optimizations to pixel-based models and exploring more advanced distillation and finetuning techniques. Continued research could also investigate integrating these models with other on-device functionalities to enhance user experience further.

In conclusion, Zhao et al.'s "MobileDiffusion" delivers significant advancements in making high-quality text-to-image generation feasible on mobile devices. The comprehensive architectural redesign and innovative sampling techniques highlight the potential for deploying sophisticated AI models on constrained hardware, paving the way for broader accessibility and utility of AI-driven applications.

Markdown Report Issue