MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation (2111.12707v4)

Published 24 Nov 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves state-of-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at \url{https://github.com/Vegetebird/MHFormer}.

Authors (5)

Wenhao Li (136 papers)
Hong Liu (396 papers)
Hao Tang (379 papers)
Pichao Wang (65 papers)
Luc Van Gool (570 papers)

Citations (211)

View on Semantic Scholar

Summary

The paper introduces MHFormer, a novel transformer-based model that generates multiple hypotheses to tackle depth ambiguity and occlusion in 3D pose estimation.
The methodology involves three stages: hypothesis generation, self-refinement using multi-hypothesis self-attention, and cross-hypothesis interaction via cross-attention.
MHFormer achieves state-of-the-art performance by reducing MPJPE by 3% on Human3.6M, indicating its practical benefits for VR, HCI, and surveillance applications.

Overview of MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

The paper presents MHFormer, a novel approach to 3D human pose estimation from monocular video inputs. This task is challenging due to depth ambiguity and self-occlusion issues inherent in single-camera setups. Typically, methods attempt to resolve these challenges by leveraging spatial and temporal dependencies. This work diverges by acknowledging the ill-posed nature of the problem, proposing a model that estimates multiple feasible solutions using a Multi-Hypothesis Transformer framework.

Methodology

MHFormer is structured around three primary stages designed to develop and refine multi-hypothesis spatio-temporal features:

Multi-Hypothesis Generation (MHG): This stage constructs initial hypothesis representations. It involves encoding diverse semantic information at multiple depths, providing a starting point for hypothesis creation in the spatial domain.
Self-Hypothesis Refinement (SHR): Each hypothesis undergoes refinement independently within this stage using a Self-Hypothesis Refinement module. It comprises a multi-hypothesis self-attention (MH-SA) for intra-hypothesis communication and a hypothesis-mixing multi-layer perceptron (MLP) to merge and redistribute information across hypotheses.
Cross-Hypothesis Interaction (CHI): In this final stage, cross-hypothesis interaction is enhanced using multi-hypothesis cross-attention (MH-CA). This fosters inter-hypothesis communication, concluding with a hypothesis-mixing MLP to synthesize the final 3D pose prediction.

Results

The paper reports that MHFormer achieves state-of-the-art results on Human3.6M and MPI-INF-3DHP datasets. Notably, it surpasses previous works on Human3.6M by 3% in average mean per joint position error (MPJPE), demonstrating the model's efficacy in creating accurate 3D pose estimations without additional refinements.

Implications

The implications of MHFormer are twofold:

Practical: By enhancing pose estimation accuracy, MHFormer can benefit applications such as human-computer interaction, virtual reality, and surveillance, where robust 3D pose estimation is critical.
Theoretical: The multi-hypothesis approach provides insights into effectively handling inverse problems in computer vision, suggesting directions for future research in similar domains.

Future Directions

Future exploration could enhance MHFormer by:

Reducing computational complexity while maintaining accuracy.
Expanding its applicability to other domains or tasks that involve inverse problem-solving.
Adapting the architecture for real-time processing demands, enhancing its applicability in interactive environments.

In conclusion, MHFormer contributes significantly to the landscape of 3D human pose estimation by effectively addressing the challenges posed by the inverse nature of the problem through an innovative multi-hypothesis framework. As AI capabilities continue to evolve, such approaches promise robust performance enhancements across a range of applications.

PDF Markdown

Related Papers

GitHub

GitHub - Vegetebird/MHFormer: [CVPR 2022] MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation (530 stars)

Tweets

https://twitter.com/FMCalisto/status/1621069869272670210