Emergent Mind

Boximator: Generating Rich and Controllable Motions for Video Synthesis

(2402.01566)
Published Feb 2, 2024 in cs.CV and cs.AI

Abstract

Generating rich and controllable motion is a pivotal challenge in video synthesis. We propose Boximator, a new approach for fine-grained motion control. Boximator introduces two constraint types: hard box and soft box. Users select objects in the conditional frame using hard boxes and then use either type of boxes to roughly or rigorously define the object's position, shape, or motion path in future frames. Boximator functions as a plug-in for existing video diffusion models. Its training process preserves the base model's knowledge by freezing the original weights and training only the control module. To address training challenges, we introduce a novel self-tracking technique that greatly simplifies the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints. Its robust motion controllability is validated by drastic increases in the bounding box alignment metric. Human evaluation also shows that users favor Boximator generation results over the base model.

Boximator controls motion in animations, including animals moving and objects' trajectory using constraint boxes.

Overview

  • The paper introduces Boximator, a method for generating videos with controllable and dynamic motion using hard and soft boxes for precision.

  • Boximator is designed as a plug-in for video diffusion models, using spatial attention blocks and novel training strategies without altering the base model parameters.

  • The self-tracking training concept helps align generated object-bounding boxes with constraints, improving predictive precision.

  • Boximator significantly outperforms baseline models, improving Frechet Video Distance (FVD) and average precision (AP) scores, as confirmed by user studies.

  • Seeder ablation studies highlight the importance of soft boxes and self-tracking in achieving high-quality, controllable motion synthesis.

Overview of Boximator

The paper introduces a novel video synthesis approach called Boximator, which focuses on rendering controllable and dynamic motion in generated videos. Designed to integrate with existing video diffusion models, Boximator employs two distinctive types of constraints known as hard boxes and soft boxes. These constraints allow users to select and influence the position, shape, and movement of objects within videos with varying levels of precision.

Methodology

The Boximator framework distinguishes itself by functioning as a plug-in, which is inserted into the spatial attention blocks of video diffusion models. To control motion without explicit textual instructions, Boximator correlates box constraints with visual elements during training, leveraging the original diffusion model's parameters without alteration. The mechanism involves encoding box constraints through a combination of coordinates, object ID, and hard/soft flagging, which are processed using self-attention layers informed by novel training strategies.

Training Innovations

A key training facilitation concept introduced is self-tracking. This refers to concurrently training the model to generate object-bounding boxes and aligning them with imposed constraints throughout the video frames. The process effectively simplifies the learning task and significantly enhances the model's predictive precision. Despite halting the generation of visible bounding boxes post-training, the model preserves the alignment capabilities by developing a robust internal representation.

Results and Significance

Empirically, Boximator delivers robust and quantifiable enhancements over baseline models. It showcases marked improvements in video quality, as evidenced by Frechet Video Distance (FVD) scores that leap forward when supplemented with box constraints (Pixel-Dance: 237 to 174, ModelScope: 239 to 216). The bounding box alignment metric further consolidates its motion controllability with significant boosts in average precision (AP) scores. User studies corroborate these findings, revealing a strong preference for Boximator's outputs in terms of both video quality and motion preciseness. Seeder ablation studies underscore the crucial roles of soft boxes and self-tracking in realizing these outcomes. Through this research, Boximator stands out as a powerful tool in the generative AI space, offering fine-grained motion control and retaining compatibilities with rapidly evolving base video diffusion models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube