Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdaDiff: Adaptive Step Selection for Fast Diffusion (2311.14768v1)

Published 24 Nov 2023 in cs.CV and cs.AI

Abstract: Diffusion models, as a type of generative models, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising for dozens of steps to produce photorealistic images/videos, which is computationally expensive. Unlike previous methods that design ``one-size-fits-all'' approaches for speed up, we argue denoising steps should be sample-specific conditioned on the richness of input texts. To this end, we introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies, which are then used by the diffusion model for generation. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function, balancing inference time and generation quality. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar results in terms of visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%. Furthermore, our qualitative analysis shows that our method allocates more steps to more informative text conditions and fewer steps to simpler text conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
  2. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  4. Diffusiondet: Diffusion model for object detection. In ICCV, 2023.
  5. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  6. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  7. Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362, 2023.
  8. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381, 2023.
  9. Implicit diffusion models for continuous super-resolution. In CVPR, 2023.
  10. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  11. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  12. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  13. Video diffusion models. In NeurIPS, 2022b.
  14. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  15. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
  16. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023a.
  17. Minimizing trajectory curvature of ode-based generative models. arXiv preprint arXiv:2301.12003, 2023b.
  18. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
  19. Microsoft coco: Common objects in context. In ECCV, 2014.
  20. Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2021.
  21. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022.
  22. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
  23. On distillation of guided diffusion models. In CVPR, 2023.
  24. Making a “completely blind” image quality analyzer. IEEE Signal processing letters, 2012.
  25. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  26. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  27. Scalable diffusion models with transformers. In CVPR, 2023.
  28. Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739, 2023.
  29. Learning transferable visual models from natural language supervision. In ICML, 2021.
  30. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  31. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  32. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  33. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  34. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  35. Improved techniques for training gans. In NeurIPS, 2016.
  36. Laioncoco. https://laion.ai/blog/laion-coco/, 2022.
  37. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  38. Denoising diffusion implicit models. In ICLR, 2021.
  39. Pseudoinverse-guided diffusion models for inverse problems. In ICLR, 2022.
  40. Improved techniques for training score-based generative models. In NeurIPS, 2020.
  41. Reinforcement learning: An introduction. MIT press, 2018.
  42. End-to-end diffusion latent optimization improves classifier guidance. In ICCV, 2023.
  43. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  44. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023b.
  45. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896, 2022.
  46. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023a.
  47. Human preference score: Better aligning text-to-image models with human preference. In ICCV, 2023b.
  48. Diffir: Efficient diffusion model for image restoration. In ICCV, 2023.
  49. Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023a.
  50. A survey on video diffusion models. arXiv preprint arXiv:2310.10647, 2023b.
  51. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  52. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In ICCV, 2023a.
  53. Imagereward: Learning and evaluating human preferences for text-to-image generation. In NeurIPS, 2023b.
  54. Versatile diffusion: Text, images and variations all in one diffusion model. In ICCV, 2023c.
  55. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
  56. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  57. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. arXiv preprint arXiv:2302.04867, 2023.
  58. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hui Zhang (405 papers)
  2. Zuxuan Wu (144 papers)
  3. Zhen Xing (25 papers)
  4. Jie Shao (53 papers)
  5. Yu-Gang Jiang (223 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets