Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models (2407.15642v2)
Abstract: Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.
- Automatic animation of hair blowing in still portrait photos. In International Conference on Computer Vision, pages 22963–22975, 2023.
- Warp-guided gans for single-photo facial animation. ACM Transactions on Graphics, 37(6):1–12, 2018.
- Imaginator: Conditional spatio-temporal gan for video generation. In Winter Conference on Applications of Computer Vision, pages 1160–1169, 2020.
- Latent image animator: Learning to animate images via latent space navigation. In International Conference on Learning Representations, 2022.
- Blowing in the wind: Cyclenet for human cinemagraphs from still images. In Computer Vision and Pattern Recognition, pages 459–468, 2023.
- Understanding object dynamics for interactive image-to-video synthesis. In Computer Vision and Pattern Recognition, pages 5171–5181, 2021.
- Motion representations for articulated animation. In Computer Vision and Pattern Recognition, pages 13653–13662, 2021.
- Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:2305.03989, 2023.
- High-resolution image synthesis with latent diffusion models. In Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International Conference on Learning Representations, 2024.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024.
- Magic3d: High-resolution text-to-3d content creation. In Computer Vision and Pattern Recognition, pages 300–309, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. In International Conference on Learning Representations, 2023.
- Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In International Conference on Learning Representations, 2024.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
- Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
- Videofusion: Decomposed diffusion models for high-quality video generation. In Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In International Conference on Learning Representations, 2024.
- Pia: Your personalized image animator via plug-and-play modules in text-to-image models. In Computer Vision and Pattern Recognition, 2024.
- Animateanything: Fine-grained open domain image animation with motion guidance. arXiv e-prints, pages arXiv–2311, 2023.
- Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023.
- Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324, 2024.
- Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769, 2024.
- Dreamvideo: High-fidelity image-to-video generation with image retention and text guidance. arXiv preprint arXiv:2312.03018, 2023.
- Seine: Short-to-long video diffusion model for generative transition and prediction. In International Conference on Learning Representations, 2023.
- Vdt: General-purpose video diffusion transformers via mask modeling. In International Conference on Learning Representations, 2023.
- Optical flow and scene flow estimation: A survey. Pattern Recognition, 114:107861, 2021.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
- Freeu: Free lunch in diffusion u-net. In Computer Vision and Pattern Recognition, 2024.
- Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537, 2023.
- Input perturbation reduces exposure bias in diffusion models. arXiv preprint arXiv:2301.11706, 2023.
- Denoising diffusion probabilistic models. In Neural Information Processing Systems, volume 33, pages 6840–6851, 2020.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. Pmlr, 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems, 35:36479–36494, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In International Conference on Computer Vision, pages 7623–7633, 2023.
- Preserve your own correlation: A noise prior for video diffusion models. In International Conference on Computer Vision, pages 22930–22941, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
- Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2022.
- Video diffusion models. Neural Information Processing Systems, 35:8633–8646, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Stochastic image-to-video synthesis using cinns. In Computer Vision and Pattern Recognition, pages 3742–3753, 2021.
- A phase-based approach for animating images using video examples. In Computer Graphics Forum, volume 36, pages 303–311. Wiley Online Library, 2017.
- F3a-gan: Facial flow for face animation with generative adversarial networks. IEEE Transactions on Image Processing, 30:8658–8670, 2021.
- Animegan: A novel lightweight gan for photo animation. In Artificial Intelligence Algorithms and Applications, pages 242–256. Springer, 2020.
- Ganimation: Anatomically-aware facial animation from a single image. In European Conference on Computer Vision, pages 818–833, 2018.
- I2v-adapter: A general image-to-video adapter for video diffusion models. arXiv preprint arXiv:2312.16693, 2023.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023.
- Magicanimate: Temporally consistent human image animation using diffusion model. In Computer Vision and Pattern Recognition, 2024.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
- Common diffusion noise schedules and sample steps are flawed. In Winter Conference on Applications of Computer Vision, pages 5404–5411, 2024.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In International Conference on Computer Vision, pages 1728–1738, 2021.
- Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision, pages 3836–3847, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Follow-your-click: Open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268, 2024.
- Msr-vtt: A large video description dataset for bridging video and language. In Computer Vision and Pattern Recognition, pages 5288–5296, 2016.
- A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2(11):1–7, 2012.
- Fvd: A new metric for video generation. In International Conference on Learning Representations Workshop, 2019.
- Temporal generative adversarial nets with singular value clipping. In International Conference on Computer Vision, pages 2830–2839, 2017.
- On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222, 5:14, 2021.
- Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Edict: Exact diffusion inversion via coupled transformations. In Computer Vision and Pattern Recognition, pages 22532–22541, 2023.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Computer Vision and Pattern Recognition, pages 1921–1930, 2023.