Object-Centric Diffusion for Efficient Video Editing (2401.05735v3)
Abstract: Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.
- Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
- Token merging for fast stable diffusion. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
- Token merging: Your vit but faster. International Conference on Learning Representations, 2023.
- Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023.
- Structure and content-guided video synthesis with diffusion models. In IEEE International Conference on Computer Vision, 2023a.
- Structure and content-guided video synthesis with diffusion models. In IEEE International Conference on Computer Vision, 2023b.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
- Generative adversarial networks. Communications of the ACM, 2020.
- Skip-convolutions for efficient video processing. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2021.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. Neural Information Processing Systems, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 2022b.
- Video diffusion models, 2022c.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. International Conference on Learning Representations, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023a.
- Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023b.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Efficient spatially sparse inference for conditional gans and diffusion models. Neural Information Processing Systems, 2022.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv preprint arXiv:2306.00980, 2023.
- Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023a.
- Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023b.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Neural Information Processing Systems, 2022a.
- Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
- Sdedit: Image synthesis and editing with stochastic differential equations. International Conference on Learning Representations, 2022.
- On distillation of guided diffusion models. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
- Microsoft. Microsoft deepspeed. https://github.com/microsoft/DeepSpeed.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
- The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
- Fatezero: Fusing attentions for zero-shot text-based video editing. IEEE International Conference on Computer Vision, 2023.
- U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 2020.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- Sbnet: Sparse blocks network for fast inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015.
- Progressive distillation for fast sampling of diffusion models. International Conference on Learning Representations, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Neural Information Processing Systems, 2022.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Denoising diffusion implicit models. International Conference on Learning Representations, 2021.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
- Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2020.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In IEEE International Conference on Computer Vision, 2023.
- Selfreformer: Self-refined network with transformer for salient object detection. arXiv preprint arXiv:2205.11283, 2022.
- Adding conditional control to text-to-image diffusion models. In IEEE International Conference on Computer Vision, 2023a.
- Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023b.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Kumara Kahatapitiya (20 papers)
- Adil Karjauv (10 papers)
- Davide Abati (15 papers)
- Fatih Porikli (141 papers)
- Amirhossein Habibian (21 papers)
- Yuki M. Asano (63 papers)