I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models (2405.16537v1)
Abstract: The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework's superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.
- Gen-2 by runway. https://research.runwayml.com/gen2, 2023.
- Pika labs. https://pika.art/, 2023.
- Adobe Inc. Adobe photoshop.
- A test of goodness of fit. Journal of the American statistical association, 49(268):765–769, 1954.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
- Video generation models as world simulators. 2024.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22560–22570, 2023.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023a.
- Eve: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints. arXiv preprint arXiv:2308.10648, 2023b.
- Improving diffusion models for virtual try-on. arXiv preprint arXiv:2403.05139, 2024.
- Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
- Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
- Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337, 2022.
- Editanything: Empowering unparalleled flexibility in image editing and generation. In Proceedings of the 31st ACM International Conference on Multimedia, Demo track, 2023.
- Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023a.
- Videoswap: Customized video subject swapping with interactive semantic point correspondence. arXiv preprint arXiv:2312.02087, 2023b.
- Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. International Conference on Learning Representations, 2024.
- Nicholas Guttenberg and CrossLabs. Diffusion with offset noise. https://www.crosslabs.org/blog/diffusion-with-offset-noise, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Stylizing video by example. ACM Transactions on Graphics (TOG), 38(4):1–11, 2019.
- Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models. arXiv preprint arXiv:2310.01107, 2023.
- Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models. arXiv preprint arXiv:2312.00845, 2023.
- Design of an image edge detection filter using the sobel operator. IEEE Journal of solid-state circuits, 23(2):358–367, 1988.
- Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Anyv2v: A plug-and-play framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468, 2024.
- High-resolution virtual try-on with misalignment and occlusion-handled conditions. arXiv preprint arXiv:2206.14180, 2022.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Movideo: Motion-aware video generation with diffusion models. arXiv preprint arXiv:2311.00000, 2023.
- Video-p2p: Video editing with cross-attention control, 2023.
- Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
- Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926, 2023.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models, 2021.
- Ciara Rowles. svd-temporal-controlnet. https://github.com/CiaraStrawberry/svd-temporal-controlnet, 2024.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. arXiv preprint arXiv:2401.15977, 2024.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Instantstyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733, 2024.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023a.
- Cvpr 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003, 2023b.
- Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv, 2023.
- Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023.
- Magicprop: Diffusion-based video editing via motion-aware appearance propagation. arXiv preprint arXiv:2309.00908, 2023a.
- Motion-conditioned image animation for video editing. arXiv preprint arXiv:2311.18827, 2023b.
- Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia Conference Proceedings, 2023.
- Space-time diffusion features for zero-shot text-driven motion transfer. arXiv preprint arxiv:2311.17009, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
- Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023.
- Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139, 2019.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. 2023a.
- Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023b.
- Controlvideo: Adding conditional control for one shot text-to-video editing. arXiv preprint arXiv:2305.17098, 2023a.
- Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465, 2023b.