Emergent Mind

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

(2401.08559)
Published Jan 16, 2024 in cs.CV , cs.GR , and cs.LG

Abstract

Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. Our code and models are publicly available at https://mathis.petrovich.fr/stmc.

Framework generalizes text-to-motion synthesis, temporal/spatial composition, and multi-track control for complex actions.

Overview

  • The paper presents a novel framework for 3D human motion synthesis controlled by text descriptions allowing for granular animation control.

  • It addresses the issue of fine-grained control in creating complex 3D animations from textual instructions.

  • The research introduces a new test-time denoising method that works well with any motion diffusion model and ensures seamless motion transitions.

  • Experimental results show that this framework outperforms existing methods in creating realistic and text-aligned motion sequences.

  • The code and models have been made publicly available, enhancing the accessibility of creating intricate, multi-action 3D animations.

Introduction

This article examines a recent innovation in the field of 3D human motion synthesis driven by textual descriptions. Traditional approaches for synthesizing human action from text have made notable progress, but they often lack the intricate control desired by animators and content creators. Addressing this limitation, the authors introduce a novel framework tailored for more granular and multifaceted control over the animation process.

The Problem of Fine-Grained Control

Creating 3D animations that follow specific text instructions usually involves using single text prompts to govern the action. However, this setup falls short when dealing with complex sequences where multiple actions need to be composed or specific timings are required for different parts of the motion. The new methodology presented provides a solution to this constraint by allowing the use of a multi-track timeline of text prompts, which can include overlapping and sequential actions.

Test-Time Denoising Method

To handle the complexity of generating animations based on the multi-track timeline, the researchers propose a new test-time denoising method, which is complementary to any pre-existing motion diffusion model. At each denoising step, the method processes text prompts individually and then smartly blends the predicted motions, taking into account the corresponding body parts involved in each action segment. This technique is particularly adept at ensuring seamless transitions between actions in both space and time.

Results and Availability

Extensive experiments demonstrate that this method outperforms established baselines in generating more realistic and textually aligned motion sequences. It also introduces a significant side contribution by improving upon motion diffusion models to support the SMPL body representation, streamlining the synthesis process. For those interested in further exploration or application, the authors have made the code and models publicly accessible online.

The ability to intricately control and synchronize multi-action 3D human animations through text opens up new possibilities for animators and further democratizes content creation. With this advancement, animations are not only more accessible but also richer and more nuanced than ever before.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube