Emergent Mind

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

(2407.15811)
Published Jul 22, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only \$1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118$\times$ lower cost than stable diffusion models and 14$\times$ lower cost than the current state-of-the-art approach that costs \$28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.

Performance of patch masking strategies showing improved image generation and similar training cost as MaskDiT.

Overview

  • The paper introduces a cost-efficient training methodology for text-to-image (T2I) diffusion transformers, significantly reducing computational resources without major performance degradation.

  • Key strategies include deferred patch masking, mixture-of-experts (MoE) layers, and the use of synthetic data, which collectively enable training on a much lower budget.

  • The model achieves a competitive FID score of 12.7 on the COCO dataset, with substantial cost and time savings compared to state-of-the-art methods.

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

The presented paper tackles the constrained-access problem in the development of advanced generative models, particularly text-to-image (T2I) diffusion transformers, by proposing a cost-efficient training methodology. This approach leverages several innovative strategies to significantly reduce the computational resources required for training without substantial performance degradation, thereby democratizing the capability to train large-scale diffusion models.

Methodology and Contributions

The authors introduce a deferred masking strategy that preprocesses image patches using a lightweight patch-mixer before masking. This method effectively mitigates the performance degradation that commonly accompanies high masking ratios. In standard masking, where up to 50% of the patches are masked, performance significantly declines. However, the deferred strategy allows up to 75% masking by retaining semantic information across non-masked patches, thus ensuring more efficient training and performance comparable to full training.

The key contributions highlighted include:

  1. Deferred Patch Masking: By utilizing a patch mixer prior to masking, the model retains more semantic context, thereby allowing for higher masking ratios.
  2. Architectural Optimizations: The incorporation of mixture-of-experts (MoE) layers and layer-wise scaling in transformer architecture significantly boosts performance while maintaining cost-efficiency.
  3. Synthetic Data Integration: The authors demonstrate the critical advantage of incorporating synthetic images into the training dataset, which substantially improves image quality and alignment.
  4. Low-Cost Training Pipeline: The combination of these techniques allows for the training of a 1.16 billion parameter sparse transformer model using only $1,890, which is considerably lower than other state-of-the-art approaches.

Results and Performance

The paper reports that the newly trained model achieved a competitive Fréchet Inception Distance (FID) of 12.7 on the COCO dataset with zero-shot generation capabilities. This outcome represents an impressive 118-fold cost reduction compared to the training requirements of Stable Diffusion models, and 14-fold compared to the current state-of-the-art low-cost training methods. Moreover, the training process was completed in just 2.6 days on an 8xH100 GPU machine, underscoring the efficiency achieved through the proposed strategies.

Performance metrics include:

  • FID Score: 12.7 on the COCO dataset.
  • Cost Efficiency: $1,890 compared to $28,400 for state-of-the-art models.
  • Computational Efficiency: 2.6 training days on an 8xH100 GPU machine.

Theoretical and Practical Implications

Theoretically, the deferred masking strategy and employment of sparse transformers with MoE layers open new avenues for efficient training of large-scale models. The approach challenges the notion that high computational resources and proprietary datasets are essential for training high-performing diffusion models.

Practically, this democratized training methodology has the potential to substantially lower the entry barriers for smaller research institutions and independent researchers. This approach could further spur innovation and progress in generative AI by making the training of advanced models accessible to a broader audience.

Future Directions

Future research could extend this work in several directions:

  1. Exploration of Further Architectural Enhancements: Investigating other model architecture improvements that can synergize with deferred masking to yield better performances.
  2. Extending to Other Modalities: Applying the deferred masking and micro-budget training strategies to other generative models beyond T2I, such as text-to-video or text-to-audio models.
  3. Optimization Beyond Algorithmic Strategies: Integrating software and hardware stack optimizations, such as 8-bit precision training and optimized data loading, to further reduce training costs.

In conclusion, the paper presents a compelling case for cost-efficient training of large-scale diffusion models. Through deferred masking, MoE layers, and strategic use of synthetic data, the authors significantly lower training overheads while maintaining competitive performance, thus moving towards democratizing the development of advanced generative models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit