Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation (2402.10491v2)

Published 16 Feb 2024 in cs.CV

Abstract: Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a $5\times$ training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just $10k$ steps, with virtually no additional inference time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  3. ∞\infty∞-diff: Infinite resolution diffusion with subsampled mollified states. arXiv preprint arXiv:2303.18242, 2023.
  4. Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  5. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  6. Stable Diffusion. Stable diffusion 2-1 base. https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/v2-1_512-ema-pruned.ckpt, 2022.
  7. Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
  8. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  9. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  10. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. arXiv preprint arXiv:2310.07702, 2023.
  11. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  12. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022a.
  13. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
  14. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Training-free diffusion model adaptation for variable-sized text-to-image synthesis. arXiv preprint arXiv:2306.08645, 2023.
  17. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022.
  18. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  19. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  20. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022a.
  21. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.
  22. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  23. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  24. Freeu: Free lunch in diffusion u-net. arXiv preprint arXiv:2309.11497, 2023.
  25. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  26. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  27. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  28. Dual diffusion implicit bridges for image-to-image translation. arXiv preprint arXiv:2203.08382, 2022.
  29. Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:2309.03350, 2023.
  30. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  31. Towards accurate generative models of video: A new metric & challenges. ICLR, 2019.
  32. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  33. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  34. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. arXiv preprint arXiv:2304.06648, 2023.
  35. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023.
  36. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  37. Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. arXiv preprint arXiv:2308.16582, 2023.
Citations (17)

Summary

  • The paper introduces a self-cascade diffusion model that leverages low-resolution pre-trained models for efficient high-resolution adaptation.
  • It employs a pivot-guided noise re-scheduling strategy combined with time-aware feature upsampling to minimize training overhead.
  • Experimental results demonstrate a significant improvement in image and video quality by achieving over a 5x faster adaptation compared to full fine-tuning.

Novel Self-Cascade Diffusion Model for Efficient High-Resolution Adaptation

Introduction

The recent developments in diffusion models have marked significant progress in the generation of high-quality images and videos. One of the critical challenges in the domain is adapting these models to generate content at higher resolutions efficiently. Full fine-tuning of large pre-trained models for higher-resolution generation results in substantial computational overhead and optimization difficulties. This paper introduces an innovative self-cascade diffusion model designed to leverage pre-existing knowledge from well-trained low-resolution models to facilitate rapid adaptation to higher-resolution tasks. The approach combines pivot-guided noise re-scheduling and time-aware feature upsampling modules, significantly enhancing the model's adaptability to higher resolutions while requiring minimal fine-tuning.

The backdrop against which this research emerges is rich with exploration into diffusion models, noted for their effectiveness in various generative tasks. Strategies for scaling these models to higher-resolution generation often involve either extensive retraining or adopting progressive training approaches, both demanding considerable computational resources. Tuning-free methods, while reducing computational demands, often struggle with maintaining fidelity in higher resolutions. Cascading super-resolution mechanisms based on diffusion models presents another line of approach, yet these too fall short in balancing parameter efficiency with generative performance.

Methodology

The self-cascade diffusion model proposed articulates a novel structure that incorporates a pivot-guided noise re-scheduling strategy for tuning-free adaptation and later refines the output through trainable upsampler modules for higher quality. This method distinctively requires only a negligible increase in trainable parameters (0.002M) and achieves a more than 5x speed-up in training compared to full fine-tuning methods.

  • Pivot-Guided Noise Re-Schedule: At its core, this strategy employs cyclic re-utilization of the low-resolution model to generate baseline content, which is then incrementally enhanced in resolution through a sequence of multiscale upsamplers.
  • Time-Aware Feature Upsampler: For situations where tuning is acceptable for additional quality gains, the paper proposes integrating upsampler modules that adapt the features extracted by the base model to match the higher-resolution domain, guided by a minimal set of higher-quality training data.

Experimental Results

The effectiveness of the proposed method is demonstrated through extensive experiments on image and video synthesis tasks, showcasing superior performance in both tuning-free and fine-tuning settings across various resolution scales. Notably, the model achieves remarkable adaptation to higher resolutions with only a small fraction of fine-tuning steps required by conventional methods, and without a significant increase in inference time.

Implications and Future Work

The introduction of a self-cascade diffusion model represents a significant advancement in the efficient generation of high-resolution images and videos. It opens new avenues for research, particularly in exploring the balance between training efficiency and output fidelity. Future investigations could explore optimizing the architecture of time-aware upsampling modules to further reduce computational demands or extend the model's applicability to other generative tasks beyond image and video synthesis.

Conclusion

This paper sets a new benchmark in the adaptive generation of higher-resolution content from diffusion models. By strategically leveraging the capabilities of well-trained low-resolution models and introducing minimal yet effective fine-tuning mechanisms, it presents a highly efficient and scalable solution to a longstanding challenge in the field of generative models.