MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls (2407.21136v3)

Published 30 Jul 2024 in cs.CV

Abstract: Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.

Summary

The paper presents ControlMM, a unified framework achieving state-of-the-art results in text-to-motion, speech-to-gesture, and music-to-dance tasks.
It introduces a novel attention mechanism that leverages both static and dynamic topology graphs to robustly model whole-body motion.
A two-stage coarse-to-fine training strategy and the ControlMM-Bench evaluation benchmark enable high-quality, consistent motion generation.

The paper "Adding Multi-modal Controls to Whole-body Human Motion Generation" introduces ControlMM, a novel unified framework designed to generate whole-body human motions under multi-modal control conditions. This framework addresses key challenges in the domain by enabling robust generalization across different generation scenarios and efficient optimization for control signals with varying levels of granularity. ControlMM achieves state-of-the-art performance in several fundamental tasks, including text-to-motion, speech-to-gesture, and music-to-dance generation.

Problem Statement and Contributions

The paper delineates two formidable challenges in the field of multi-modal, whole-body motion generation:

Motion Distribution Drift: Different control signals (e.g., text, speech, music) lead to substantially varying motion distributions, complicating the transfer of motion knowledge across different scenarios.
Optimization Under Mixed Conditions: Mixed control signals with varying granularities (high-level semantic versus low-level frame-wise instructions) complicate the learning process, leading to optimization issues.

To tackle these challenges, the authors introduce ControlMM, a unified framework designed to control whole-body multi-modal motion generation in a plug-and-play manner. Here are its notable contributions:

ControlMM-Attn: A novel attention mechanism that models both static and dynamic human topology graphs in parallel.
Two-stage Coarse-to-fine Training Strategy: The framework initially undergoes a coarse-grained text-to-motion pre-training phase followed by fine-grained multi-modal control adaptation.
ControlMM-Bench: The first publicly available multi-modal whole-body human motion generation benchmark based on the unified SMPL-X format, facilitating the evaluation of various generative tasks on a consistent format.

Methodology

ControlMM Framework

The ControlMM framework uses a dual-phase training approach to handle the control signal diversity effectively:

Stage 1: Text-to-Motion Semantic Pre-training: In this phase, the model employs text as a universal conditioning modality. The backbone model, a transformer-based diffusion model with ControlMM-Attn, learns the spatial properties of human motion across various datasets.
Stage 2: Multi-modal Low-level Control Adaptation: After freezing the pre-trained backbone model, the framework introduces separate branches for different control modalities (e.g., speech and music). These branches include plug-and-play capabilities for low-level controls, allowing the model to specialize in various scenarios without introducing optimization confusion.

ControlMM-Attn Architecture

ControlMM-Attn, the core innovation of the framework, specializes in parallel modeling of static and dynamic human topology graphs:

Static-skeleton Graph Learner: Initializes with a static adjacency matrix capturing the fundamental human skeletal structure, which accelerates convergence and enhances generalization.
Dynamic-topology Relationship Graph Learner: Uses dynamic, input-dependent adjacency matrices, allowing the model to adapt to specific control signals and motion configurations.
Temporal Attention Module: Captures frame-wise dependencies, enabling sequentially consistent motion generation aligned with the textural, auditory, or musical control signals.

ControlMM-Bench

ControlMM-Bench enhances the evaluation granularity by standardizing data representations across various motion generation tasks into the SMPL-X format. This consistency allows for more accurate comparisons and assessments of the generated motions’ quality, diversity, and control fidelity.

Experimental Evaluation

The paper presents extensive evaluations across multiple scenarios:

Text-to-Motion: ControlMM achieves superior semantic relevance, fidelity, and diversity compared to existing methods. It shows robust performance improvements on HumanML3D and ControlMM-Bench datasets.
Speech-to-Gesture: The model demonstrates significant improvements in terms of both quality and diversity, following the speech rhythm and generating contextually appropriate gestures and facial expressions.
Music-to-Dance: ControlMM effectively aligns generated dances with the music beats, producing diverse and natural dance sequences.

Ablation Studies

The ablation studies validate several critical aspects of the model design:

The combination of static and dynamic topology graphs substantially improves performance, verifying the necessity of parallel modeling.
Freezing the body-wise encoder and decoder during the second training stage helps retain generalizable motion topology knowledge.
Model performance scales with size, but the benefits reach a plateau without corresponding increases in data quality and quantity.

Conclusion

ControlMM sets a new benchmark in whole-body human motion generation with its multi-modal control capabilities. By addressing the key challenges of motion distribution drift and granular optimization, ControlMM offers a robust, generalizable solution for generating natural human motions across various scenarios. ControlMM-Bench further contributes to the field by providing a unified evaluation standard, setting the stage for future research and development in multi-modal human motion generation.

PDF Markdown

Related Papers

GitHub

ControlMM

Tweets

https://twitter.com/_vztu/status/1819770930631205193