MotionMaster: Training-free Camera Motion Transfer For Video Generation (2404.15789v2)
Abstract: The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
- Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
- A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023.
- U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022.
- H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” arXiv preprint arXiv:2401.09047, 2024.
- H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang et al., “Videocrafter1: Open diffusion models for high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023.
- T.-S. Chen, C. H. Lin, H.-Y. Tseng, T.-Y. Lin, and M.-H. Yang, “Motion-conditioned diffusion model for controllable video synthesis,” arXiv preprint arXiv:2304.14404, 2023.
- S. Tu, Q. Dai, Z.-Q. Cheng, H. Hu, X. Han, Z. Wu, and Y.-G. Jiang, “Motioneditor: Editing video motion via content-aware diffusion,” arXiv preprint arXiv:2311.18830, 2023.
- C. Chen, J. Shu, L. Chen, G. He, C. Wang, and Y. Li, “Motion-zero: Zero-shot moving object control framework for diffusion-based video generation,” arXiv preprint arXiv:2401.10150, 2024.
- S. Yang, L. Hou, H. Huang, C. Ma, P. Wan, D. Zhang, X. Chen, and J. Liao, “Direct-a-video: Customized video generation with user-directed camera movement and object motion,” arXiv preprint arXiv:2402.03162, 2024.
- J. Bai, T. He, Y. Wang, J. Guo, H. Hu, Z. Liu, and J. Bian, “Uniedit: A unified tuning-free framework for video motion and appearance editing,” arXiv preprint arXiv:2402.13185, 2024.
- C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, and Q. Chen, “Fatezero: Fusing attentions for zero-shot text-based video editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 932–15 942.
- J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7623–7633.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
- Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P. Luo, and Y. Shan, “Motionctrl: A unified and flexible motion controller for video generation,” arXiv preprint arXiv:2312.03641, 2023.
- C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” Advances in neural information processing systems, vol. 29, 2016.
- T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro, “Few-shot video-to-video synthesis,” arXiv preprint arXiv:1910.12713, 2019.
- M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2830–2839.
- J. Zhang, C. Xu, L. Liu, M. Wang, X. Wu, Y. Liu, and Y. Jiang, “Dtvnet: Dynamic time-lapse video generation via single still image,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 2020, pp. 300–315.
- R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023.
- A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 563–22 575.
- J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022.
- J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
- D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2022.
- J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,” arXiv preprint arXiv:2308.06571, 2023.
- W. Chen, J. Wu, P. Xie, H. Wu, J. Li, X. Xia, X. Xiao, and L. Lin, “Control-a-video: Controllable text-to-video generation with diffusion models,” arXiv preprint arXiv:2305.13840, 2023.
- P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356.
- Y. Guo, C. Yang, A. Rao, M. Agrawala, D. Lin, and B. Dai, “Sparsectrl: Adding sparse controls to text-to-video diffusion models,” arXiv preprint arXiv:2311.16933, 2023.
- Y. Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y. Liu, Y. Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” arXiv preprint arXiv:2312.04433, 2023.
- H. Jeong, G. Y. Park, and J. C. Ye, “Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models,” arXiv preprint arXiv:2312.00845, 2023.
- Y. Jain, A. Nasery, V. Vineet, and H. Behl, “Peekaboo: Interactive video generation via masked-diffusion,” arXiv preprint arXiv:2312.07509, 2023.
- Y. Teng, E. Xie, Y. Wu, H. Han, Z. Li, and X. Liu, “Drag-a-video: Non-rigid video editing with point-based interaction,” arXiv preprint arXiv:2312.02936, 2023.
- R. Wu, L. Chen, T. Yang, C. Guo, C. Li, and X. Zhang, “Lamp: Learn a motion pattern for few-shot-based video generation,” arXiv preprint arXiv:2310.10769, 2023.
- R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. Liu, W. Wu, J. Keppo, and M. Z. Shou, “Motiondirector: Motion customization of text-to-video diffusion models,” arXiv preprint arXiv:2310.08465, 2023.
- X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou, “Videocomposer: Compositional video synthesis with motion controllability,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Y. Deng, R. Wang, Y. Zhang, Y.-W. Tai, and C.-K. Tang, “Dragvideo: Interactive drag-style video editing,” arXiv preprint arXiv:2312.02216, 2023.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026.
- S. Zhou, X. Jiang, W. Tan, R. He, and B. Yan, “Mvflow: Deep optical flow estimation of compressed videos with motion vector prior,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1964–1974.
- D. Fleet and Y. Weiss, “Optical flow estimation,” in Handbook of mathematical models in computer vision. Springer, 2006, pp. 237–257.
- H. Zhang, D. Liu, Q. Zheng, and B. Su, “Modeling video as stochastic processes for fine-grained video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2225–2234.
- Z. Gharibi and S. Faramarzi, “Multi-frame spatio-temporal super-resolution,” Signal, Image and Video Processing, vol. 17, no. 8, pp. 4415–4424, 2023.
- D. Young, “Iterative methods for solving partial difference equations of elliptic type,” Transactions of the American Mathematical Society, vol. 76, no. 1, pp. 92–111, 1954.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
- M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in kdd, vol. 96, no. 34, 1996, pp. 226–231.
- J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
- T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,” arXiv preprint arXiv:1812.01717, 2018.
- Y. Balaji, M. R. Min, B. Bai, R. Chellappa, and H. P. Graf, “Conditional gan with discriminative filter generation for text-to-video synthesis.” in IJCAI, vol. 1, no. 2019, 2019, p. 2.
- G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” in Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13. Springer, 2003, pp. 363–370.
- Teng Hu (26 papers)
- Jiangning Zhang (102 papers)
- Ran Yi (68 papers)
- Yating Wang (39 papers)
- Hongrui Huang (3 papers)
- Jieyu Weng (2 papers)
- Yabiao Wang (93 papers)
- Lizhuang Ma (145 papers)