GIMO: Gaze-Informed Human Motion Prediction in Context

Published 20 Apr 2022 in cs.CV | (2204.09443v2)

Abstract: Predicting human motion is critical for assistive robots and AR/VR applications, where the interaction with humans needs to be safe and comfortable. Meanwhile, an accurate prediction depends on understanding both the scene context and human intentions. Even though many works study scene-aware human motion prediction, the latter is largely underexplored due to the lack of ego-centric views that disclose human intent and the limited diversity in motion and scenes. To reduce the gap, we propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, as well as ego-centric views with the eye gaze that serves as a surrogate for inferring human intent. By employing inertial sensors for motion capture, our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects. We perform an extensive study of the benefits of leveraging the eye gaze for ego-centric human motion prediction with various state-of-the-art architectures. Moreover, to realize the full potential of the gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches. Our network achieves the top performance in human motion prediction on the proposed dataset, thanks to the intent information from eye gaze and the denoised gaze feature modulated by the motion. Code and data can be found at https://github.com/y-zheng18/GIMO.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (57)

View on Semantic Scholar

Summary

The paper introduces a gaze-informed prediction model that integrates eye gaze to capture human intent from varied contextual scenes.
The novel bidirectional architecture fuses gaze and motion branches, significantly enhancing prediction accuracy versus state-of-the-art methods.
A new large-scale dataset with detailed pose sequences and scene scans underpins comprehensive evaluation of human motion prediction systems.

GIMO: Gaze-Informed Human Motion Prediction in Context

The task of predicting human motion is integral to the development of systems for assistive robotics and augmented/virtual reality (AR/VR), where interaction with humans must be conducted both safely and comfortably. The paper "GIMO: Gaze-Informed Human Motion Prediction in Context" introduces a novel approach to human motion prediction that acknowledges the critical role played by scene context and human intention. The authors emphasize that while scene-aware motion prediction has been rigorously studied, understanding the user's intention remains largely underexplored. This research seeks to bridge that gap through the utilization of eye gaze data as a proxy for human intention.

Contributions and Dataset

One of the keystones of this research is the introduction of a large-scale dataset that captures high-quality body pose sequences, scene scans, and ego-centric views equipped with eye gaze information. The dataset is characterized by diverse motion dynamics and scene contexts, facilitated by the use of inertial sensors for motion capture, untethered to specific scenes. This enables a comprehensive exploration of the potential of eye gaze in human motion prediction.

Methodology

The paper leverages an innovative network architecture enabling bidirectional communication between gaze and motion branches. Unlike traditional approaches that treat motion and gaze independently, the proposed architecture integrates eye gaze into the prediction pipeline, allowing for cross-modal attention that richly informs future motion predictions. This bidirectional method not only enhances motion prediction by utilizing intent data from eye gaze but also denoises gaze features through modulation by motion data.

Key Findings and Results

Empirically, the authors present results indicating the top-tier performance of their network in human motion prediction on the introduced dataset, overshadowing various state-of-the-art architectures. Numerical results underscore the robustness of leveraging gaze as a correlate to underlying human intent, enhancing predictability of subsequent human actions.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, the integration of gaze data in motion prediction systems can significantly improve human-robot interaction frameworks, making them more intuitive and user-aware. Theoretically, the study advances the understanding of multi-modal data fusion in AI, presenting avenues for deeper exploration into intention prediction models.

Future developments may include refinements in network architectures to better handle sparse gaze data and further improvements in the fidelity of gaze-motion integration. Additionally, extended explorations into other modalities such as voice or physiological signals could offer further insights into the scope of intention prediction.

Overall, the paper represents a meaningful advance in the field of human motion prediction, offering novel strategies for employing gaze data to more accurately and contextually predict human motion dynamics.

Markdown Report Issue