Emergent Mind

Adding Multi-modal Controls to Whole-body Human Motion Generation

(2407.21136)
Published Jul 30, 2024 in cs.CV

Abstract

Whole-body multi-modal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to accomplish various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different generation scenarios and the complex optimization of mixed conditions with varying granularity. Furthermore, inconsistent motion formats in existing datasets further hinder effective multi-modal motion generation. In this paper, we propose ControlMM, a unified framework to Control whole-body Multi-modal Motion generation in a plug-and-play manner. To effectively learn and transfer motion knowledge across different motion distributions, we propose ControlMM-Attn, for parallel modeling of static and dynamic human topology graphs. To handle conditions with varying granularity, ControlMM employs a coarse-to-fine training strategy, including stage-1 text-to-motion pre-training for semantic generation and stage-2 multi-modal control adaptation for conditions of varying low-level granularity. To address existing benchmarks' varying motion format limitations, we introduce ControlMM-Bench, the first publicly available multi-modal whole-body human motion generation benchmark based on the unified whole-body SMPL-X format. Extensive experiments show that ControlMM achieves state-of-the-art performance across various standard motion generation tasks. Our website is at https://yxbian23.github.io/ControlMM.

ControlMM: A two-stage transformer-based diffusion model for cross-scenario motion learning with advanced attention mechanisms.

Overview

  • The paper presents ControlMM, a unified framework aimed at generating whole-body human motions under multi-modal control conditions, achieving superior performance in tasks like text-to-motion, speech-to-gesture, and music-to-dance generation.

  • To address challenges like motion distribution drift and mixed optimization conditions, ControlMM employs a novel attention mechanism (ControlMM-Attn) and a two-stage coarse-to-fine training strategy, ensuring flexibility and robustness across different control signals.

  • The introduction of ControlMM-Bench provides a standardized benchmark for evaluating multi-modal whole-body human motion generation using the SMPL-X format, facilitating more accurate and consistent assessments of model performance.

Adding Multi-modal Controls to Whole-body Human Motion Generation

The paper "Adding Multi-modal Controls to Whole-body Human Motion Generation" introduces ControlMM, a novel unified framework designed to generate whole-body human motions under multi-modal control conditions. This framework addresses key challenges in the domain by enabling robust generalization across different generation scenarios and efficient optimization for control signals with varying levels of granularity. ControlMM achieves state-of-the-art performance in several fundamental tasks, including text-to-motion, speech-to-gesture, and music-to-dance generation.

Problem Statement and Contributions

The paper delineates two formidable challenges in the realm of multi-modal, whole-body motion generation:

  1. Motion Distribution Drift: Different control signals (e.g., text, speech, music) lead to substantially varying motion distributions, complicating the transfer of motion knowledge across different scenarios.
  2. Optimization Under Mixed Conditions: Mixed control signals with varying granularities (high-level semantic versus low-level frame-wise instructions) complicate the learning process, leading to optimization issues.

To tackle these challenges, the authors introduce ControlMM, a unified framework designed to control whole-body multi-modal motion generation in a plug-and-play manner. Here are its notable contributions:

  1. ControlMM-Attn: A novel attention mechanism that models both static and dynamic human topology graphs in parallel.
  2. Two-stage Coarse-to-fine Training Strategy: The framework initially undergoes a coarse-grained text-to-motion pre-training phase followed by fine-grained multi-modal control adaptation.
  3. ControlMM-Bench: The first publicly available multi-modal whole-body human motion generation benchmark based on the unified SMPL-X format, facilitating the evaluation of various generative tasks on a consistent format.

Methodology

ControlMM Framework

The ControlMM framework uses a dual-phase training approach to handle the control signal diversity effectively:

  • Stage 1: Text-to-Motion Semantic Pre-training: In this phase, the model employs text as a universal conditioning modality. The backbone model, a transformer-based diffusion model with ControlMM-Attn, learns the spatial properties of human motion across various datasets.
  • Stage 2: Multi-modal Low-level Control Adaptation: After freezing the pre-trained backbone model, the framework introduces separate branches for different control modalities (e.g., speech and music). These branches include plug-and-play capabilities for low-level controls, allowing the model to specialize in various scenarios without introducing optimization confusion.

ControlMM-Attn Architecture

ControlMM-Attn, the core innovation of the framework, specializes in parallel modeling of static and dynamic human topology graphs:

  • Static-skeleton Graph Learner: Initializes with a static adjacency matrix capturing the fundamental human skeletal structure, which accelerates convergence and enhances generalization.
  • Dynamic-topology Relationship Graph Learner: Uses dynamic, input-dependent adjacency matrices, allowing the model to adapt to specific control signals and motion configurations.
  • Temporal Attention Module: Captures frame-wise dependencies, enabling sequentially consistent motion generation aligned with the textural, auditory, or musical control signals.

ControlMM-Bench

ControlMM-Bench enhances the evaluation granularity by standardizing data representations across various motion generation tasks into the SMPL-X format. This consistency allows for more accurate comparisons and assessments of the generated motions’ quality, diversity, and control fidelity.

Experimental Evaluation

The paper presents extensive evaluations across multiple scenarios:

  • Text-to-Motion: ControlMM achieves superior semantic relevance, fidelity, and diversity compared to existing methods. It shows robust performance improvements on HumanML3D and ControlMM-Bench datasets.
  • Speech-to-Gesture: The model demonstrates significant improvements in terms of both quality and diversity, following the speech rhythm and generating contextually appropriate gestures and facial expressions.
  • Music-to-Dance: ControlMM effectively aligns generated dances with the music beats, producing diverse and natural dance sequences.

Ablation Studies

The ablation studies validate several critical aspects of the model design:

  • The combination of static and dynamic topology graphs substantially improves performance, verifying the necessity of parallel modeling.
  • Freezing the body-wise encoder and decoder during the second training stage helps retain generalizable motion topology knowledge.
  • Model performance scales with size, but the benefits reach a plateau without corresponding increases in data quality and quantity.

Conclusion

ControlMM sets a new benchmark in whole-body human motion generation with its multi-modal control capabilities. By addressing the key challenges of motion distribution drift and granular optimization, ControlMM offers a robust, generalizable solution for generating natural human motions across various scenarios. ControlMM-Bench further contributes to the field by providing a unified evaluation standard, setting the stage for future research and development in multi-modal human motion generation.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.