Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation (2401.08559v2)

Published 16 Jan 2024 in cs.CV, cs.GR, and cs.LG

Abstract: Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. Our code and models are publicly available at https://mathis.petrovich.fr/stmc.

Citations (22)

View on Semantic Scholar

Summary

The paper introduces a multi-track timeline framework that enhances fine-grained control in text-driven 3D human motion synthesis.
It employs a test-time denoising method that processes overlapping text prompts to blend motion segments seamlessly.
Extensive experiments show the approach outperforms baselines, making advanced 3D animation more accessible and precise.

Introduction

This article examines a recent innovation in the field of 3D human motion synthesis driven by textual descriptions. Traditional approaches for synthesizing human action from text have made notable progress, but they often lack the intricate control desired by animators and content creators. Addressing this limitation, the authors introduce a novel framework tailored for more granular and multifaceted control over the animation process.

The Problem of Fine-Grained Control

Creating 3D animations that follow specific text instructions usually involves using single text prompts to govern the action. However, this setup falls short when dealing with complex sequences where multiple actions need to be composed or specific timings are required for different parts of the motion. The new methodology presented provides a solution to this constraint by allowing the use of a multi-track timeline of text prompts, which can include overlapping and sequential actions.

Test-Time Denoising Method

To handle the complexity of generating animations based on the multi-track timeline, the researchers propose a new test-time denoising method, which is complementary to any pre-existing motion diffusion model. At each denoising step, the method processes text prompts individually and then smartly blends the predicted motions, taking into account the corresponding body parts involved in each action segment. This technique is particularly adept at ensuring seamless transitions between actions in both space and time.

Results and Availability

Extensive experiments demonstrate that this method outperforms established baselines in generating more realistic and textually aligned motion sequences. It also introduces a significant side contribution by improving upon motion diffusion models to support the SMPL body representation, streamlining the synthesis process. For those interested in further exploration or application, the authors have made the code and models publicly accessible online.

The ability to intricately control and synchronize multi-action 3D human animations through text opens up new possibilities for animators and further democratizes content creation. With this advancement, animations are not only more accessible but also richer and more nuanced than ever before.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1748044168419500167

https://twitter.com/wutronic/status/1750371735181078918

https://twitter.com/WilliamLamkin/status/1748159675701264423

https://twitter.com/gm8xx8/status/1747490697412755624

YouTube

Show All Videos