Emergent Mind

Controllable Human-Object Interaction Synthesis

(2312.03913)
Published Dec 6, 2023 in cs.CV

Abstract

Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.

Overview

  • Paper introduces Controllable Human-Object Interaction Synthesis (CHOIS) for generating human and object motion from natural language in 3D.

  • CHOIS uses a conditional diffusion model with language input, initial state, and sparse object waypoints to guide realistic interactions.

  • Innovations include an object geometry loss to refine motion and guidance terms for plausible human-object contact during training.

  • Technique employs transformer-based denoising neural networks and guidance functions to improve hand-object interaction realism.

  • CHOIS shows superior performance in synthesizing long-horizon, environment-aware human-object interactions, outperforming baselines.

Introduction

Synthesizing human behaviors that interact with objects in a realistic manner within 3D environments is pivotal for advancements in diverse applications such as computer graphics, AI, and robotics. This work addresses the complex problem of generating simultaneous human and object motion from natural language descriptions, while addressing constraints imposed by initial states and environmental geometry.

Human-Object Interaction Synthesis

To intertwine human and object actions, this approach, termed Controllable Human-Object Interaction Synthesis (CHOIS), utilizes a conditional diffusion model. The model takes language input that signifies the intent and style of the interaction, an initial state of the object and human, and an outline of sparse object waypoints, which steer the motion within the context of the scene. While language enables specification of actions, waypoints ensure spatial anchoring of those actions. An innovative aspect of CHOIS is the use of an object geometry loss which refines the generated object motion to adhere to guided waypoints more accurately. Furthermore, guidance terms are introduced during training, enforcing realistic contact amidst the human and object and ensuring plausible interaction given environmental clutter.

Technique Details

CHOIS operates by encoding the geometry of objects using a Basis Point Set (BPS) and combining this with a masked motion condition vector that considers initial states and occasional 2D/3D object positions. A transformer-based denoising neural network is employed, taking these conditions and noisy representations of the desired end state to generate synchronized human and object motion. To bolster hand-object contact realism, a guidance function is applied that minimizes discrepancies during the inference phase without the need for explicit loss functions, which are typically expensive to compute and difficult to balance.

Evaluation and Applications

Assessed on datasets featuring diverse human-object interactions, CHOIS outperforms the adapted baselines and demonstrates effectiveness in generating realistic actions based on textual descriptions and various object sizes. An ablation study highlights the significance of guidance terms in enhancing the accuracy of contact models and the fidelity of actions. By successfully integrating CHOIS within a pipeline, continuous long-horizon, environment-aware human-object interactions are synthesized given language inputs and encompassing 3D scenes. An application of this method showcases its ability to adhere to language prompts, adapt to different objects, manage sparse waypoints, and negotiate environments effectively.

In summary, CHOIS presents a significant step forward in the creation of dynamic human-object interactions in virtual scenarios, offering promising tools for the development of technically advanced systems capable of emulating human actions and decision-making processes with high accuracy and context-awareness.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube