Object-Centric Diffusion for Efficient Video Editing (2401.05735v3)

Published 11 Jan 2024 in cs.CV and cs.LG

Abstract: Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.

References (50)

Authors (6)

Kumara Kahatapitiya (20 papers)
Adil Karjauv (10 papers)
Davide Abati (15 papers)
Fatih Porikli (141 papers)
Amirhossein Habibian (21 papers)
Yuki M. Asano (63 papers)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces object-centric diffusion methods that focus on edited regions to significantly boost video editing efficiency.
It details object-centric sampling and 3D token merging techniques that optimize computational load and reduce memory requirements.
Experiments on inversion and ControlNet models demonstrate latency reductions up to 10x and memory savings of up to 17x while preserving quality.

Efficiency Improvements in Video Editing with Object-Centric Diffusion

Introduction to the Efficiency Challenge

Video editing powered by diffusion models has made great strides in quality and capability. Such models can now incorporate textual edit prompts to modify the global style, local structure, and attributes of video footage. Nonetheless, these advancements come with a significant computational load. Traditional techniques, including diffusion inversion and cross-frame self-attention, ensure temporal coherence but are computationally intense. This paper seeks to address these inefficiencies by proposing modifications that conserve quality while drastically speeding up the editing process.

Breaking Down Inefficiencies

Investigations into the current frameworks for video editing have identified major points of inefficiency, particularly in memory and computational demands. These problems largely stem from attention-based guidance and a high volume of diffusion steps during the video generation process. This research identifies ways to leverage existing optimizations, such as efficient samplers and token reduction in attention layers, to amplify speed without deterring the quality of edited content.

Object-Centric Techniques

The core innovation of the paper revolves around Object-Centric Diffusion (OCD). This method focuses computational efforts on the foreground, harnessing the concept that edits are often most crucial where the action is. Two novel techniques are introduced:

Object-Centric Sampling: This method differentiates diffusion processes between edited and background areas, allowing for a more computationally efficient focus on the regions of interest.
Object-Centric 3D Token Merging: This approach streamlines cross-frame attention by merging tokens in the less significant background regions, exploiting redundancies to reduce workload.

These techniques can be applied to existing video editing models swiftly, without retraining, while significantly lowering memory usage and computational costs.

Demonstrated Results and Contributions

Applying the proposed techniques to existing inversion-based and ControlNet-based video editing frameworks, the researchers attained impressive results. Notably, they achieved a latency reduction by a factor of 10x in inversion-based models and 6x in ControlNet-based models, with memory savings up to 17x, all the while maintaining comparative synthesis quality.

The contributions can be summarized as follows:

An analysis and suggestions for acceleration in current video editing models.
Introduction of Object-Centric Sampling for focused diffusion processing.
Introduction of Object-Centric 3D Token Merging which reduces the number of cross-frame attention tokens.
Optimization of two recent video editing models, showcasing rapid editing speeds without sacrificing quality.

Through extensive experiments, this paper affirms that focusing computational resources on imperative regions using object-centric solutions enhances the quality and efficiency of video editing. This work presents a meaningful step toward more efficient, high-quality video editing that can be of great benefit in various applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/amir_habibian/status/1746201180898734142

https://twitter.com/IAmACatAI/status/1746094744865058866