- The paper presents a novel video synthesis approach, Boximator, that leverages hard and soft box constraints for precise motion control in generated videos.
- It integrates with existing video diffusion models via spatial attention blocks and employs self-tracking to align object bounding boxes with imposed constraints.
- Empirical evaluations demonstrate significant improvements in video quality and motion precision, with notable gains in FVD and AP scores.
Overview of Boximator
The paper introduces a novel video synthesis approach called Boximator, which focuses on rendering controllable and dynamic motion in generated videos. Designed to integrate with existing video diffusion models, Boximator employs two distinctive types of constraints known as hard boxes and soft boxes. These constraints allow users to select and influence the position, shape, and movement of objects within videos with varying levels of precision.
Methodology
The Boximator framework distinguishes itself by functioning as a plug-in, which is inserted into the spatial attention blocks of video diffusion models. To control motion without explicit textual instructions, Boximator correlates box constraints with visual elements during training, leveraging the original diffusion model's parameters without alteration. The mechanism involves encoding box constraints through a combination of coordinates, object ID, and hard/soft flagging, which are processed using self-attention layers informed by novel training strategies.
Training Innovations
A key training facilitation concept introduced is self-tracking. This refers to concurrently training the model to generate object-bounding boxes and aligning them with imposed constraints throughout the video frames. The process effectively simplifies the learning task and significantly enhances the model's predictive precision. Despite halting the generation of visible bounding boxes post-training, the model preserves the alignment capabilities by developing a robust internal representation.
Results and Significance
Empirically, Boximator delivers robust and quantifiable enhancements over baseline models. It showcases marked improvements in video quality, as evidenced by Frechet Video Distance (FVD) scores that leap forward when supplemented with box constraints (Pixel-Dance: 237 to 174, ModelScope: 239 to 216). The bounding box alignment metric further consolidates its motion controllability with significant boosts in average precision (AP) scores. User studies corroborate these findings, revealing a strong preference for Boximator's outputs in terms of both video quality and motion preciseness. Seeder ablation studies underscore the crucial roles of soft boxes and self-tracking in realizing these outcomes. Through this research, Boximator stands out as a powerful tool in the generative AI space, offering fine-grained motion control and retaining compatibilities with rapidly evolving base video diffusion models.