Decoupling Human and Camera Motion from Videos in the Wild (2302.12827v2)

Published 24 Feb 2023 in cs.CV

Abstract: We propose a method to reconstruct global human trajectories from videos in the wild. Our optimization method decouples the camera and human motion, which allows us to place people in the same world coordinate frame. Most existing methods do not model the camera motion; methods that rely on the background pixels to infer 3D human motion usually require a full scene reconstruction, which is often not possible for in-the-wild videos. However, even when existing SLAM systems cannot recover accurate scene reconstructions, the background pixel motion still provides enough signal to constrain the camera motion. We show that relative camera estimates along with data-driven human motion priors can resolve the scene scale ambiguity and recover global human trajectories. Our method robustly recovers the global 3D trajectories of people in challenging in-the-wild videos, such as PoseTrack. We quantify our improvement over existing methods on 3D human dataset Egobody. We further demonstrate that our recovered camera scale allows us to reason about motion of multiple people in a shared coordinate frame, which improves performance of downstream tracking in PoseTrack. Code and video results can be found at https://vye16.github.io/slahmr.

Authors (4)

Vickie Ye (10 papers)
Georgios Pavlakos (45 papers)
Jitendra Malik (211 papers)
Angjoo Kanazawa (84 papers)

Citations (61)

View on Semantic Scholar

Summary

The paper presents SLAHMR, which decouples human and camera motion to accurately reconstruct global human trajectories in dynamic scenes.
It employs a multi-stage optimization that initializes, smooths, and refines trajectories using SfM estimates and a learned human motion prior.
Evaluations on EgoBody and PoseTrack show significant improvements in World-MPJPE and acceleration error compared to existing methods.

Overview of "Decoupling Human and Camera Motion from Videos in the Wild"

The paper, "Decoupling Human and Camera Motion from Videos in the Wild," introduces a novel method for reconstructing global human trajectories from videos captured with dynamic and uncontrolled camera motion. The proposed approach, named SLAHMR (Simultaneous Localization And Human Mesh Recovery), leverages relative camera motion estimates and learned human motion priors to position both people and cameras within a world coordinate frame.

Methodology

The method focuses on decoupling human motion from camera motion to accurately track human trajectories. Traditional methods often rely on static assumptions or require full scene reconstructions, which can be impractical in uncontrolled settings. The authors propose using structure-from-motion (SfM) systems to compute relative camera motion between frames, even when full dense 3D reconstruction is unattainable. By coupling this information with human motion priors, the method resolves scale ambiguities, crucial for recovering accurate global trajectories.

The method involves a multi-stage optimization process:

Initialization: The process begins by estimating per-frame poses and generating initial world-coordinate trajectories.
Smoothing: Kinematic motion is smoothed to warm-start the joint optimization.
Joint Optimization: In this final stage, a learned transition-based human motion prior (HuMoR) is incorporated. The optimization aligns human trajectories with visual observations in the video and adjusts for plausible human motion.

Numerical Results and Analysis

The paper quantifies the improvements of this method using datasets like EgoBody, which provide dynamic camera and 3D global ground truth, and PoseTrack, characterized by complex in-the-wild scenarios. The system's evaluations yield strong advancements over existing methods such as GLAMR, particularly in global trajectory accuracy and reduced acceleration error. Specifically, SLAHMR achieves significantly lower World-MPJPE and Acceleration Error, highlighting both spatial accuracy and realistic motion modeling.

Practical and Theoretical Implications

This research has substantial implications for real-world applications such as autonomous navigation, surveillance, and human-robot interaction, where understanding global human trajectories is critical. By providing a method that operates effectively without the need for elaborate controlled setups, this approach broadens the applicability of human motion analysis, particularly in real-world and challenging environments.

Theoretically, this work opens avenues for more sophisticated modeling of human dynamics against dynamic backgrounds, encouraging further exploration of disentangling multiple sources of motion within visual data.

Future Directions

This method lays the groundwork for future enhancements in simultaneous localization and mapping (SLAM) systems, particularly in leveraging human dynamics to refine camera motion estimations. Investigations into broader datasets, integration with physics-based simulations, and enhancements in real-time application could further extend the method's utility.

In conclusion, SLAHMR represents a significant advancement in the field of video-based human trajectory reconstruction, offering a robust solution to the challenges of dynamic scenes with moving cameras. Its combination of geometric insights and learned priors offers a promising direction for both applied and theoretical developments in video analysis.

PDF Markdown

Related Papers

GitHub

GitHub - vye16/slahmr (503 stars)
Decoupling Human and Camera Motion from Videos in the Wild

Tweets

YouTube

Show All Videos