Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SinFusion: Training Diffusion Models on a Single Image or Video (2211.11743v3)

Published 21 Nov 2022 in cs.CV and cs.LG

Abstract: Diffusion models exhibited tremendous progress in image and video generation, exceeding GANs in quality and diversity. However, they are usually trained on very large datasets and are not naturally adapted to manipulate a given input image or video. In this paper we show how this can be resolved by training a diffusion model on a single input image or video. Our image/video-specific diffusion model (SinFusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models. It can solve a wide array of image/video-specific manipulation tasks. In particular, our model can learn from few frames the motion and dynamics of a single input video. It can then generate diverse new video samples of the same dynamic scene, extrapolate short videos into long ones (both forward and backward in time) and perform video upsampling. Most of these tasks are not realizable by current video-specific generation methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Singan-gif: Learning a generative video model from a single gif. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1310–1319, 2021.
  2. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18208–18218, 2022.
  3. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
  4. A database and evaluation methodology for optical flow. International journal of computer vision, 92(1):1–31, 2011.
  5. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015.
  6. Recycle-gan: Unsupervised video retargeting. In Proceedings of the European conference on computer vision (ECCV), pp.  119–135, 2018.
  7. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  8. A survey on generative diffusion model. arXiv preprint arXiv:2209.02646, 2022.
  9. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
  10. Diffusion models in vision: A survey. arXiv preprint arXiv:2209.04747, 2022.
  11. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pp. 1174–1183. PMLR, 2018.
  12. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  13. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp.  341–346, 2001.
  14. Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pp.  1033–1038. IEEE, 1999.
  15. Zlib compression library. 2004.
  16. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022. URL https://arxiv.org/abs/2208.01618.
  17. Drop the gan: In defense of patches nearest neighbors as single image generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13460–13469, 2022.
  18. Hierarchical patch vae-gan: Generating diverse videos from a single sample. arXiv preprint arXiv:2006.12226, 2020.
  19. Diverse generation from a single video made possible. arXiv preprint arXiv:2109.08591, 2021.
  20. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  22. Improved techniques for training single-image gans. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1300–1309, 2021.
  23. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  24. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  25. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022b.
  26. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022c.
  27. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
  28. Using cloud shadows to infer scene structure and camera calibration. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.  1102–1109. IEEE, 2010.
  29. Two cloud-based cues for estimating scene structure and camera calibration. IEEE transactions on pattern analysis and machine intelligence, 35(10):2526–2538, 2013.
  30. Adversarial score matching and improved sampling for image generation. arXiv preprint arXiv:2009.05475, 2020.
  31. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  32. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4401–4410, 2019.
  33. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8110–8119, 2020.
  34. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  35. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  36. Kolmogorov, A. N. On tables of random numbers. Sankhyā: The Indian Journal of Statistics, Series A, pp. 369–376, 1963.
  37. Sinddm: A single image denoising diffusion model. arXiv preprint arXiv:2211.16582, 2022.
  38. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11976–11986, 2022.
  39. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  40. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
  41. Automatic differentiation in pytorch. 2017.
  42. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  43. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  44. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp.  234–241. Springer, 2015.
  45. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  46. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022a.
  47. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
  48. Singan: Learning a generative model from a single natural image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4570–4580, 2019.
  49. Ingan: Capturing and remapping the” dna” of a natural image. arXiv preprint arXiv:1812.00231, 2018.
  50. Summarizing visual data using bidirectional similarity. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.  1–8. IEEE, 2008.
  51. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  52. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3626–3636, 2022.
  53. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
  54. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  55. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  56. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1526–1535, 2018.
  57. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
  58. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
  59. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
  60. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation. arXiv preprint arXiv:2205.09853, 2022.
  61. Generating videos with scene dynamics. arXiv preprint arXiv:1609.02612, 2016.
  62. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision, pp.  700–717. Springer, 2020.
  63. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
  64. Sindiffusion: Learning a diffusion model from a single natural image. arXiv preprint arXiv:2211.12445, 2022.
  65. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  66. Positional encoding as spatial inductive bias in GANs. arXiv preprint arXiv:2012.05217, 2020.
  67. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
  68. Self-attention generative adversarial networks. In International conference on machine learning, pp. 7354–7363. PMLR, 2019.
Citations (56)

Summary

We haven't generated a summary for this paper yet.