Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices (2405.12211v1)
Abstract: Text-to-image (T2I) diffusion models achieve state-of-the-art results in image synthesis and editing. However, leveraging such pretrained models for video editing is considered a major challenge. Many existing works attempt to enforce temporal consistency in the edited video through explicit correspondence mechanisms, either in pixel space or between deep features. These methods, however, struggle with strong nonrigid motion. In this paper, we introduce a fundamentally different approach, which is based on the observation that spatiotemporal slices of natural videos exhibit similar characteristics to natural images. Thus, the same T2I diffusion model that is normally used only as a prior on video frames, can also serve as a strong prior for enhancing temporal consistency by applying it on spatiotemporal slices. Based on this observation, we present Slicedit, a method for text-based video editing that utilizes a pretrained T2I diffusion model to process both spatial and spatiotemporal slices. Our method generates videos that retain the structure and motion of the original video while adhering to the target text. Through extensive experiments, we demonstrate Slicedit's ability to edit a wide range of real-world videos, confirming its clear advantages compared to existing competing methods. Webpage: https://matankleiner.github.io/slicedit/
- Spatiotemporal energy models for the perception of motion. Josa a, 2(2):284–299, 1985.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
- Epipolar-plane image analysis: An approach to determining structure from motion. International journal of computer vision, 1(1):7–55, 1987.
- InstructPix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18392–18402, 2023.
- Canny, J. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
- Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 23206–23217, October 2023.
- FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=JgqftqZQZ7.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356, 2023.
- Ccedit: Creative and controllable video editing via diffusion models. arXiv preprint arXiv:2309.16496, 2023.
- Tokenflow: Consistent diffusion features for consistent video editing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=lKK50q2MtV.
- Prompt-to-prompt image editing with cross-attention control. In International Conference on Learning Representations (ICLR), 2023.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models, 2020.
- Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision, pp. 624–642. Springer, 2022.
- An edit friendly DDPM noise space: Inversion and manipulations, 2023.
- Imagic: Text-based real image editing with diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15954–15964, October 2023.
- Learning blind video temporal consistency. In Proceedings of the European conference on computer vision (ECCV), pp. 170–185, 2018.
- A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJC_jUqxe.
- Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
- Motion analysis and segmentation through spatio-temporal slices processing. IEEE Transactions on Image Processing, 12(3):341–355, 2003.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15932–15942, October 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 12179–12188, 2021.
- Dynamosaics: Video mosaics with non-chronological time. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pp. 58–65. IEEE, 2005.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
- Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 402–419. Springer, 2020.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1921–1930, June 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Vis 3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices. Journal of Electronic Imaging, 23(1):013016–013016, 2014.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 7623–7633, October 2023a.
- Cvpr 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003, 2023b.
- Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023.
- Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia Conference Proceedings, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847, 2023.
- Controlvideo: Training-free controllable text-to-video generation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5a79AqFr0c.
- Across scales and across dimensions: Temporal super-resolution using deep internal learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp. 52–68. Springer, 2020.