Emergent Mind

EdgeFusion: On-Device Text-to-Image Generation

(2404.11925)
Published Apr 18, 2024 in cs.LG , cs.AI , and cs.CV

Abstract

The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

EdgeFusion produces high-quality images from challenging prompts using improved data and few denoising steps.

Overview

  • EdgeFusion optimizes Stable Diffusion models for edge devices by integrating architectural refinement, advanced model distillation, and data quality optimization.

  • The methodology includes advanced distillation using the BK-SDM-Tiny model and an LCM scheduler, enhanced data quality control through manual and synthetic methods, and deployment strategies like Model-level Tiling and mixed-precision quantization.

  • Experimental results show a drastic reduction in inference time to under a second per image while still maintaining high image quality on resource-constrained devices.

  • The research sets a foundation for future model optimization on edge devices and could significantly impact real-world applications in mobile and embedded systems.

Enhancing Stable Diffusion Models for Edge Deployments with Advanced Distillation and Optimized Data Strategies

Introduction to the Research

The research paper presents an innovative approach, termed EdgeFusion, aimed at addressing the significant computational challenges associated with deploying Stable Diffusion (SD) models on resource-constrained edge devices. The authors propose solutions that integrate architectural refinement, advanced model distillation, and tailored optimization of image-text data quality to significantly reduce inference time while maintaining high-quality text-to-image generation capabilities.

Proposed Methodology

Advanced Distillation for LCM

EdgeFusion builds upon a variant of the SD model called BK-SDM-Tiny, utilizing the Latent Consistency Model (LCM) for reducing sampling steps. The primary challenge addressed is the unsatisfactory results when directly deploying LCM on compact models using existing datasets. The researchers tackle this by employing a two-phase training process:

  1. Initial Training: Utilizing advanced "teacher" models to perform feature-level knowledge distillation.
  2. Fine-Tuning: Employing an LCM scheduler to refine the model further, ensuring robust reduction in denoising steps.

Enhanced Data Quality

A significant portion of this study is dedicated to optimizing the input data quality, which includes:

  • Data Preprocessing: Techniques like deduplication and optimized cropping are used to improve the existing real-world dataset.
  • Synthetic Data Generation: To overcome the limitations of real-world data, the team uses AI to generate synthetic image-text pairs, ensuring higher control over data quality and diversity.
  • Manual Data Curation: Despite automated approaches, manual intervention is shown to further refine the data quality, achieving better model training outcomes.

Deployment on Edge Devices

The method includes specific adaptations for deployment on Neural Processing Units (NPU):

  • Model-level Tiling (MLT): This strategy is essential for managing limited memory on edge devices, facilitating efficient model operations by optimizing data handling between different types of memory within the device architecture.
  • Quantization: The approach utilizes mixed-precision quantization to adapt the model for execution on specific hardware, striking a balance between computational demand and model performance.

Experimental Setup and Data

The experimental setup detailed in the paper spans across various stages of training and deployment, which includes utilizing high-performance GPUs for model training and edge-specific NPUs for deployment evaluations. Concerning data, high-quality synthetic datasets and curated subsets play a crucial role in the experimental framework, enabling the refinement of model training processes.

Results and Observations

The EdgeFusion method demonstrated promising results:

  • Inference Efficiency: The model achieved a rapid generation of images with drastically reduced latency, operating under one second on resource-constrained devices.
  • Image Quality: The research provides substantial empirical evidence showing that the image quality remains high even with the reduced computational overhead.
  • Comparative Analysis: When compared with previous models, EdgeFusion shows a significant advancement in reducing inference steps while maintaining or enhancing the text-image alignment and image realism.

Implications and Future Work

The implications of this research are vast for real-world applications, especially in areas where computing resources are limited, such as mobile devices and embedded systems. The ability to deploy powerful generative models on such platforms could transform various industries, including mobile photography, augmented reality, and real-time visual content generation.

Looking forward, the strategies developed in this research could set a foundational framework for further explorations into model optimization for edge devices. Future work could explore the integration of these approaches with other AI-driven tasks, expanding the utility and efficiency of generative models in practical applications. Additionally, continuous improvements in dataset quality and distillation methods might lead to even faster and more efficient model deployments.

In summary, EdgeFusion represents a significant step forward in making sophisticated text-to-image models more accessible on devices with limited computational capacity, opening up new avenues for both academic research and practical applications in the field of generative AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.