VidToMe: Video Token Merging for Zero-Shot Video Editing (2312.10656v2)
Abstract: Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.
- Video generative adversarial networks: a review. ACM Computing Surveys (CSUR), 55(2):1–25, 2022.
- Blended diffusion for text-driven editing of natural images. In CVPR, pages 18208–18218, 2022.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
- Token merging for fast stable diffusion. CVPR Workshop on Efficient Deep Learning for Computer Vision, 2023.
- Token merging: Your ViT but faster. In ICLR, 2023.
- Pix2video: Video editing using image diffusion. In ICCV, pages 23206–23217, 2023.
- Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
- Diffusion models in vision: A survey. IEEE TPAMI, 2023.
- Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
- Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2022a.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022b.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
- Prompt-to-prompt image editing with cross-attention control. In ICLR, 2022.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
- Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, pages 9000–9008, 2018.
- Elucidating the design space of diffusion-based generative models. NeurIPS, 35:26565–26577, 2022.
- Denoising diffusion restoration models. NeurIPS, 35:23593–23606, 2022.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Auto-Encoding Variational Bayes. In ICLR, 2014.
- Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022.
- Generative adversarial networks for image and video synthesis: Algorithms and applications. Proceedings of the IEEE, 109(5):839–862, 2021.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, pages 11461–11471, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2021.
- Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171. PMLR, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804. PMLR, 2022.
- Zero-shot image-to-image translation. In SIGGRAPH, pages 1–11, 2023.
- A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- Palette: Image-to-image diffusion models. In SIGGRAPH, pages 1–10, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022b.
- Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. In ICLR, 2020a.
- Score-based generative modeling through stochastic differential equations. In ICML, 2020b.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930, 2023.
- Attention is all you need. NeurIPS, 30, 2017.
- Sketch-guided text-to-image diffusion models. In SIGGRAPH, pages 1–11, 2023.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023a.
- Cvpr 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003, 2023b.
- Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH, 2023.
- Video probabilistic diffusion models in projected latent space. In CVPR, pages 18456–18466, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.