Boximator: Generating Rich and Controllable Motions for Video Synthesis (2402.01566v1)

Published 2 Feb 2024 in cs.CV and cs.AI

Abstract: Generating rich and controllable motion is a pivotal challenge in video synthesis. We propose Boximator, a new approach for fine-grained motion control. Boximator introduces two constraint types: hard box and soft box. Users select objects in the conditional frame using hard boxes and then use either type of boxes to roughly or rigorously define the object's position, shape, or motion path in future frames. Boximator functions as a plug-in for existing video diffusion models. Its training process preserves the base model's knowledge by freezing the original weights and training only the control module. To address training challenges, we introduce a novel self-tracking technique that greatly simplifies the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints. Its robust motion controllability is validated by drastic increases in the bounding box alignment metric. Human evaluation also shows that users favor Boximator generation results over the base model.

Citations (23)

View on Semantic Scholar

Summary

The paper presents a novel video synthesis approach, Boximator, that leverages hard and soft box constraints for precise motion control in generated videos.
It integrates with existing video diffusion models via spatial attention blocks and employs self-tracking to align object bounding boxes with imposed constraints.
Empirical evaluations demonstrate significant improvements in video quality and motion precision, with notable gains in FVD and AP scores.

Overview of Boximator

The paper introduces a novel video synthesis approach called Boximator, which focuses on rendering controllable and dynamic motion in generated videos. Designed to integrate with existing video diffusion models, Boximator employs two distinctive types of constraints known as hard boxes and soft boxes. These constraints allow users to select and influence the position, shape, and movement of objects within videos with varying levels of precision.

Methodology

The Boximator framework distinguishes itself by functioning as a plug-in, which is inserted into the spatial attention blocks of video diffusion models. To control motion without explicit textual instructions, Boximator correlates box constraints with visual elements during training, leveraging the original diffusion model's parameters without alteration. The mechanism involves encoding box constraints through a combination of coordinates, object ID, and hard/soft flagging, which are processed using self-attention layers informed by novel training strategies.

Training Innovations

A key training facilitation concept introduced is self-tracking. This refers to concurrently training the model to generate object-bounding boxes and aligning them with imposed constraints throughout the video frames. The process effectively simplifies the learning task and significantly enhances the model's predictive precision. Despite halting the generation of visible bounding boxes post-training, the model preserves the alignment capabilities by developing a robust internal representation.

Results and Significance

Empirically, Boximator delivers robust and quantifiable enhancements over baseline models. It showcases marked improvements in video quality, as evidenced by Frechet Video Distance (FVD) scores that leap forward when supplemented with box constraints (Pixel-Dance: 237 to 174, ModelScope: 239 to 216). The bounding box alignment metric further consolidates its motion controllability with significant boosts in average precision (AP) scores. User studies corroborate these findings, revealing a strong preference for Boximator's outputs in terms of both video quality and motion preciseness. Seeder ablation studies underscore the crucial roles of soft boxes and self-tracking in realizing these outcomes. Through this research, Boximator stands out as a powerful tool in the generative AI space, offering fine-grained motion control and retaining compatibilities with rapidly evolving base video diffusion models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1754373879340957843

https://twitter.com/gm8xx8/status/1754327611935789077

https://twitter.com/skylerrosling/status/1755085522266145114

https://twitter.com/lopezunwired/status/1754937510260834419

https://twitter.com/pawelmarciniuk/status/1755320751001915698

https://twitter.com/WilliamLamkin/status/1754504853378740698

YouTube

Show All Videos