THOR: Text to Human-Object Interaction Diffusion via Relation Intervention (2403.11208v1)

Published 17 Mar 2024 in cs.CV

Abstract: This paper addresses new methodologies to deal with the challenging task of generating dynamic Human-Object Interactions from textual descriptions (Text2HOI). While most existing works assume interactions with limited body parts or static objects, our task involves addressing the variation in human motion, the diversity of object shapes, and the semantic vagueness of object motion simultaneously. To tackle this, we propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR). THOR is a cohesive diffusion model equipped with a relation intervention mechanism. In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion. This intervention enhances the spatial-temporal relations between humans and objects, with human-centric interaction representation providing additional guidance for synthesizing consistent motion from text. To achieve more reasonable and realistic results, interaction losses is introduced at different levels of motion granularity. Moreover, we construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset. Both quantitative and qualitative experiments demonstrate the effectiveness of our proposed model.

References (1)

Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of sequences based on a “best of many” sample objective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8485–8493 (2018)

Citations (13)

View on Semantic Scholar

Summary

The paper proposes THOR, a diffusion model that enhances HOI generation by refining object motion through relation intervention.
It features separate pathways for rotation and translation to accurately model human-object kinematics in a spatial-temporal context.
Empirical evaluations on the enriched Text-BEHAVE dataset demonstrate improved interaction realism over baseline models.

Exploring Text to Human-Object Interaction Diffusion with Relation Intervention

Introduction

The task of generating dynamic Human-Object Interactions (HOI) from textual descriptions, coined as Text2HOI, is a formidable challenge that surfaces the complexities underlying the accurate representation and interaction of humans with objects in a shared space. This paper ventures into this intricate domain, introducing a novel framework dubbed THOR (Text-conditioned Human-Object Interaction diffusion with Relation intervention). The essence of THOR lies in its innovative approach to refine the object motion generation process through human-object relation intervention, thereby enhancing the spatial-temporal dynamics imperative for realistic interaction synthesis.

Proposition of THOR

At the heart of THOR is a cohesive diffusion model enriched with a relation intervention mechanism. This intervention is pivotal, specifically in instilling a nuanced understanding and coordination of human and object motions derived from textual descriptions. Traditionally, direct generation from text leads to ambiguities, especially in rendering object motion that requires a deeper contextual comprehension of human-object interplay. THOR addresses this gap by initiating motion generation with text-guided human and object trajectories while leveraging an intervention mechanism to refine the object motion, ensuring it resonates with the human-centric context of the interaction.

Strategic Implementation of THOR

Human-Object Relation Intervention: THOR meticulously models human-object kinematic relations, addressing the challenging aspects of rotation and translation through separate intervention pathways. This design choice is instrumental in preserving the distinctive nature of these transformations, allowing for a more rich and context-aware generation of motion.

Multi-level Interaction Supervision: To anchor the generated interactions in realism, THOR integrates supervision at various levels of motion granularity. This involves the introduction of specialized objective functions that encapsulate both kinematic relations and geometric distance between humans and objects. Such a multi-faceted supervisory approach ensures the generation of diverse, plausible interactions that are anchored in a realistic portrayal of human-object dynamics.

Text-BEHAVE Dataset

To facilitate training and evaluation, a supplementary dataset, Text-BEHAVE, was constructed, enriching the largest publicly available 3D HOI dataset with textual descriptions. This dataset underscores both the complexity and the diversity of human-object interactions, serving as a robust benchmark for Text2HOI tasks.

Empirical Evaluations and Future Perspectives

THOR demonstrates superior performance over existing approaches, showcased through exhaustive quantitative and qualitative analyses. Specifically, it outperforms baseline models in generating interactions that are not only diverse and plausible but also consistent and in harmony with the textual prompts.

The research highlights certain limitations, such as the handling of intricate object shapes and the generation of long-term interactions, setting the stage for future explorations. Potential avenues include enriching datasets with more comprehensive HOI sequences and incorporating fine-grained control over generated interactions. Furthermore, the robust treatment of dexterous hand motions presents an exciting frontier for enhancing the verisimilitude of generated human-object interactions.

On a Concluding Note

The THOR framework marks a significant stride in the text-guided synthesis of human-object interactions. Through its novel intervention mechanism and dedicated focus on relational dynamics, it presents a compelling solution to the nuanced challenge of Text2HOI. As the field advances, the insights and methodologies proposed by this research hold promise for fostering more interactive, intuitive, and immersive human-computer interaction paradigms.