Emergent Mind

Abstract

We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by 60% from prior work. https://yufu-wang.github.io/tram4d/

TRAM reconstructs 3D human motion from video: global trajectory, local body movement in various scenarios.

Overview

  • TRAM introduces a novel method to extract 3D human trajectory and motion from in-the-wild videos by enhancing SLAM robustness and utilizing scene background cues.

  • The methodology includes robustifying SLAM for camera trajectory estimation and creating a video transformer model, VIMO, for kinematic body motion recovery.

  • Significant improvements were noted in reducing global motion errors and increasing accuracy in human trajectory and motion estimation.

  • The research opens new pathways for applications in VR, AR, sports analytics, and clinical gait analysis, and sets a foundation for future advancements in human motion analysis.

TRAM: A Novel Two-Stage Approach for 3D Human Trajectory and Motion Reconstruction from In-the-Wild Videos

Introduction

The extraction of 3D human trajectory and motion from casual, in-the-wild videos presents a notable challenge within computer vision research. The primary hurdle lies in effectively accounting for the motion of both the subject and underlying camera. In addressing this challenge, Yufu Wang and collaborators propose TRAM, a method that notably enhances the robustness of Simultaneous Localization and Mapping (SLAM) against dynamic objects and applies scene background cues for motion scale derivation. TRAM's novel approach not only disentangles but also proficiently solves the intricacies associated with camera trajectory estimation and kinematic body motion recovery from monocular video inputs.

TRAM Methodology

TRAM operates in two distinct stages: robustification of SLAM for camera trajectory estimation and introduction of a novel video transformer model named VIMO for accurate kinematic body motion recovery. By isolating the video into static background and dynamic human components, TRAM skillfully navigates around the traditional pitfalls of motion estimation in dynamic environments.

Robustifying SLAM

For robust camera trajectory estimation, dynamic regions within the video frames are masked, directing the SLAM process to rely solely on static background information. This dual masking approach significantly alleviates the inaccuracies caused by moving subjects within the video frame. Further, to anchor the camera's motion estimation to a real-world scale, TRAM utilizes semantic depth cues extracted from the scene, thus enabling the conversion of the camera trajectory from an arbitrary unit to a metric scale.

Kinematic Body Motion Recovery

For capturing the human body's kinematic motion, TRAM introduces VIMO, a video transformer model leveraging the foundational strength of a pre-trained image transformer model. VIMO extends this with two additional temporal transformers designed to enhance feature extraction across video frames and encode a rich, temporal understanding of human motion. This fully transformer-based architecture achieves state-of-the-art performance in reconstructing smooth and natural human body motions.

Evaluation and Findings

TRAM's efficacy is underscored by its significant reduction of global motion errors by 60\% in comparison to prior works, substantiating its superiority in 3D human motion estimation from in-the-wild videos. The method demonstrates remarkable improvements across various benchmarks for both camera motion robustness and human trajectory and motion accuracy. Notably, TRAM's approach to deducing motion scale from scene semantics paves the way for more generalized applications, beyond the constraints of studio motion capture data.

Theoretical and Practical Implications

This research contributes significantly to the fields of computer vision and human motion analysis by providing a novel, scalable, and generalized method for 3D human motion reconstruction from any monocular video footage. From a practical standpoint, TRAM opens new avenues in numerous applications including virtual reality, augmented reality, sports analytics, and clinical gait analysis.

Future Directions and Avenues for Exploration

The TRAM framework sets a solid foundation for future research, including the exploration of integrating more diverse and complex scene semantics for scale estimation and further enhancements to the temporal transformers for even more nuanced understanding of human motion. The modular nature of TRAM's design invites the possibility of incorporating advancements in depth estimation and transformer models, suggesting an exciting trajectory for future developments in 3D human motion estimation technology.

In conclusion, TRAM represents a significant advancement in the automated analysis of human motion from video data, demonstrating not only a marked improvement in accuracy and robustness but also paving the way for new applications and research within the domain of computer vision.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.