Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 220 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models (2404.05519v1)

Published 8 Apr 2024 in cs.CV and cs.LG

Abstract: With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  2. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  4. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  5. Training-free layout control with cross-attention guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5343–5353, 2024a.
  6. A generalist framework for panoptic segmentation of images and videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 909–919, 2023a.
  7. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023b.
  8. Deconstructing denoising diffusion models for self-supervised learning, 2024b.
  9. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pages 424–432. Springer, 2016.
  10. Text-to-image diffusion models are zero shot classifiers. Advances in Neural Information Processing Systems, 36, 2024.
  11. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  12. Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems, 36, 2024.
  13. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  14. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  15. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  16. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  17. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  18. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  20. Intriguing properties of generative classifiers. arXiv preprint arXiv:2309.16779, 2023.
  21. Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models. arXiv preprint arXiv:2312.00845, 2023.
  22. Diffusion models for zero-shot open-vocabulary segmentation, 2023.
  23. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  24. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  25. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  26. Your diffusion model is secretly a zero-shot classifier. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023.
  27. Video-p2p: Video editing with cross-attention control, 2023.
  28. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9117–9125, 2023.
  29. Lego: Learning to disentangle and invert concepts beyond object appearance in text-to-image diffusion models. arXiv preprint arXiv:2311.13833, 2023.
  30. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023.
  31. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  32. Localizing object-level shape variations with text-to-image diffusion models, 2023.
  33. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  34. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  35. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.
  38. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  39. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  40. Localizing objects with self-supervised transformers and no labels. In BMVC, 2021.
  41. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  42. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  43. Denoising diffusion implicit models. In ICLR, 2021.
  44. A simple latent diffusion approach for panoptic segmentation and mask inpainting. arXiv preprint arXiv:2401.10227, 2024.
  45. Discovering object masks with transformers for unsupervised semantic segmentation. arXiv preprint arXiv:2206.06363, 2022.
  46. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  47. Cut and learn for unsupervised object detection and instance segmentation. In CVPR, 2023b.
  48. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
  49. Self-supervised transformers for unsupervised object discovery using normalized cut. In CVPR, 2022.
  50. Generative visual prompt: Unifying distributional control of pre-trained generative models. Advances in Neural Information Processing Systems, 35:22422–22437, 2022.
  51. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  52. Direct-a-video: Customized video generation with user-directed camera movement and object motion, 2024.
  53. Diffusion model as representation learner. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18938–18949, 2023.
  54. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  55. Adding conditional control to text-to-image diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023.
  56. Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465, 2023.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.