3D Human Pose Estimation with Spatial and Temporal Transformers

Published 18 Mar 2021 in cs.CV, cs.AI, and cs.HC | (2103.10455v3)

Abstract: Transformer architectures have become the model of choice in natural language processing and are now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation. However, in the field of human pose estimation, convolutional architectures still remain dominant. In this work, we present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames, then output an accurate 3D human pose of the center frame. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments show that PoseFormer achieves state-of-the-art performance on both datasets. Code is available at \url{https://github.com/zczcwh/PoseFormer}

Abstract PDF Upgrade to Chat

Authors (6)

Citations (383)

View on Semantic Scholar

Summary

The paper introduces PoseFormer, which integrates spatial and temporal transformers to enhance 3D human pose estimation accuracy.
It employs a unique two-module design where spatial transformers capture joint relations and temporal transformers enforce frame consistency.
State-of-the-art results on Human3.6M and MPI-INF-3DHP validate its effectiveness and potential for broader vision tasks.

Essay on "3D Human Pose Estimation with Spatial and Temporal Transformers"

The paper "3D Human Pose Estimation with Spatial and Temporal Transformers" presents a novel approach for estimating 3D human poses from video sequences by leveraging the capabilities of transformer architectures. Transformers have been the dominant model in NLP due to their ability to capture long-range dependencies through self-attention mechanisms, and this paper explores their potential in computer vision, specifically for 3D human pose estimation.

Background and Motivation

Human pose estimation (HPE) involves localizing joints to construct a skeletal representation from 2D images or videos. There are two primary approaches: direct estimation and 2D-to-3D lifting. The latter is more promising due to its use of enhanced 2D pose detectors. However, challenges exist, such as depth ambiguity and occlusion. This necessitates the incorporation of temporal information, traditionally addressed using CNNs or recurrent neural networks, which have their own limitations in terms of temporal window size and sequential correlation constraints.

The introduction of transformers, known for their scalability and efficiency, offers a pathway to overcoming these limitations by capturing global correlations across entire sequences.

Proposed Methodology: PoseFormer

The paper introduces PoseFormer, a pioneering transformer-based model for 3D human pose estimation under the 2D-to-3D lifting paradigm. PoseFormer uniquely integrates spatial and temporal transformers:

Spatial Transformer Module: Responsible for encoding local relationships among joints in each frame, thus capturing kinematic dependencies. Each 2D joint coordinate is embedded as a token, allowing the spatial transformer encoder to derive an expressive representation for each frame.
Temporal Transformer Module: Captures global dependencies across frames, enabling the model to encode the sequence's temporal coherence comprehensively. This module analyzes spatial features from individual frames and improves the accuracy of 3D pose estimations.

These components work in a harmonized manner, allowing PoseFormer to effectively model both spatial and temporal information without overwhelming computational costs.

Experimental Results

PoseFormer was evaluated on prominent datasets Human3.6M and MPI-INF-3DHP. It achieved state-of-the-art results with an MPJPE (Mean Per Joint Position Error) of 44.3mm on Human3.6M, outperforming existing models, including prior transformer-based approaches which did not consider temporal consistency. On MPI-INF-3DHP, PoseFormer also led in PCK, AUC, and MPJPE metrics, illustrating its capability in handling diverse pose variations.

The model's strength is further exemplified in more challenging scenarios, such as complex actions where precise temporal dynamics are critical.

Implications and Future Directions

The introduction of PoseFormer represents a significant contribution to 3D human pose estimation by illustrating that transformers can effectively model spatial and temporal aspects without convolutional networks. This opens avenues for other vision tasks to explore non-traditional model architectures.

Future development could explore optimizing transformers for smaller datasets, as transformers currently require pre-training on large-scale ones. Adapting PoseFormer to outdoor and occluded scenarios could enhance its robustness, addressing a common challenge in real-world applications.

Conclusion

This work successfully employs transformer architectures outside their primary field, setting a precedent for their use in complex vision tasks. PoseFormer's design and results emphasize the transformative potential of self-attention in capturing intricate dependencies inherent in HPE, suggesting promising directions for extending transformers across various domains in AI research.

Markdown Report Issue