Emergent Mind

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

(2403.19652)
Published Mar 28, 2024 in cs.CV and cs.AI

Abstract

Text-conditioned human motion generation has experienced significant advancements with diffusion models trained on extensive motion capture data and corresponding textual annotations. However, extending such success to 3D dynamic human-object interaction (HOI) generation faces notable challenges, primarily due to the lack of large-scale interaction data and comprehensive descriptions that align with these interactions. This paper takes the initiative and showcases the potential of generating human-object interactions without direct training on text-interaction pair data. Our key insight in achieving this is that interaction semantics and dynamics can be decoupled. Being unable to learn interaction semantics through supervised training, we instead leverage pre-trained large models, synergizing knowledge from a large language model and a text-to-motion model. While such knowledge offers high-level control over interaction semantics, it cannot grasp the intricacies of low-level interaction dynamics. To overcome this issue, we further introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion. By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner. We apply InterDreamer to the BEHAVE and CHAIRS datasets, and our comprehensive experimental analysis demonstrates its capability to generate realistic and coherent interaction sequences that seamlessly align with the text directives.

A framework analyzing descriptions with LLMs guiding a model translating text into human actions.

Overview

  • InterDreamer is a novel framework aimed at generating 3D dynamic human-object interaction sequences from textual descriptions, utilizing pre-trained LLMs and a world model.

  • It employs semantic analysis to extract interaction goals from text descriptions and uses a text-to-motion model and an interaction retrieval model for initial human and object state generation.

  • A novel world model predicts future states of objects influenced by human interaction, enabling realistic interaction dynamics.

  • InterDreamer demonstrates improved ability in zero-shot text-to-HOI generation over other methods on BEHAVE and CHAIRS datasets, indicating its potential in applications like virtual reality and animation.

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction Generation Framework

Introduction

The synthesis of human motion conditioned on textual descriptions has made strides with diffusion models trained on extensive datasets with motion capture data and textual annotations. Despite this advancement, the leap towards generating 3D dynamic human-object interactions (HOIs) from textual inputs remains a challenge, primarily due to the rarity of large-scale interaction data fully annotated with detailed descriptions. "InterDreamer" is a novel framework designed to generate text-aligned 3D HOI sequences. Leveraging the decoupling of interaction semantics from dynamics, it utilizes a combination of pre-trained LLMs and a world model to comprehend simple physics, enabling the generation of realistic interactions without direct training on text-to-interaction paired data.

Methodology

High-Level Planning

InterDreamer begins with semantic analysis of a textual description using LLMs to extract high-level interaction goals, including targeted objects and the nature of interaction. This process reformulates textual descriptions to close the distributional gap between text and the subsequent model's understanding, aligning generated human motion and object interaction with textual guidance more closely.

Low-Level Control

For generating initial human motion and object interaction states, InterDreamer integrates a text-to-motion model and an interaction retrieval model. These components process semantic information derived from LLMs to produce initial human poses and object states that are semantically aligned with the target interaction text, setting the stage for dynamic interaction rollouts.

World Model

InterDreamer proposes a novel world model, which, through the decoupled dynamics learned from motion capture data, predicts future states of objects influenced by human interaction. Notably, the interaction dynamics are steered by vertex-level control over sampled contact regions on the human body, permitting the dynamics model to forecast object motion by focusing on areas of actual human-object contact.

Experimental Results

InterDreamer is extensively evaluated on the BEHAVE and CHAIRS datasets, demonstrating its ability to generate coherent and realistic interaction sequences that closely follow the textual directives. The framework's efficacy in zero-shot text-to-HOI generation is compared against various baselines, showing substantial improvements in capturing the nuances of realistic interactions.

Implications and Future Directions

InterDreamer represents a leap towards intuitive and expressive methods for generating dynamic 3D human-object interactions directly from textual descriptions. Its innovative approach to decoupling interaction semantics and dynamics has broader implications for the development of more generalized and robust AI systems capable of understanding and interacting with the physical world in a human-like manner. The framework opens avenues for future research into more complex interactions, the integration of multi-modal data, and the exploration of advanced training strategies to further enhance the quality and diversity of generated interactions.

In conclusion, InterDreamer sets a new precedent for text-guided human-object interaction generation, paving the way for advancements in interactive applications, virtual reality, and animation, among other fields, by enabling more natural and intuitive creation of complex interactive scenes directly from textual descriptions. Its successful leveraging of existing models and novel world model architecture for zero-shot learning showcases the untapped potential of AI in understanding and mimicking complex real-world interactions.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.