Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video (2308.04074v3)
Abstract: Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g. self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction. On one hand, we leverage temporal context to complement insufficient information provided by the single frame, and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness. On the other hand, we further propose an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions. Extensive experiments are performed to validate the effectiveness of our proposed framework, which achieves new state-of-the-art performance on public benchmarks.
- Motion capture of hands in action using discriminative salient points. In Proceedings of the European Conference on Computer Vision. 640–653.
- 3D hand shape and pose from images in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10843–10852.
- Weakly-supervised 3D hand pose estimation from monocular RGB images. In Proceedings of the European Conference on Computer Vision. 666–682.
- Model-based 3D hand reconstruction via self-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10451–10460.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
- A winning hand: Compressing deep networks can improve out-of-distribution robustness. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34.
- Understanding atomic hand-object interaction with human intention. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2021), 275–285.
- Learning to disambiguate strongly interacting hands via probabilistic per-pixel part segmentation. In 2021 International Conference on 3D Vision. 1–10.
- Contactopt: Optimizing contact to improve grasps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1471–1481.
- Region ensemble network: Improving convolutional network for hand pose estimation. In Proceedings of the IEEE International Conference on Image Processing. 4512–4516.
- MEgATrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM Transactions on Graphics 39, 4 (2020), 87.
- Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 571–580.
- Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11807–11816.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
- SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
- Hand-Model-Aware Sign Language Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 1558–1566.
- Egocentric Hand Detection Via Dynamic Region Growing. ACM Transaction on Multimedia Computing, Communication, and Applications 14, 1 (2017), 1–17.
- AWR: Adaptive weighting regression for 3D hand pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence. 11061–11068.
- Hand pose estimation via latent 2.5D heatmap regression. In Proceedings of the European Conference on Computer Vision. 118–134.
- GLPose: Global-Local Representation Learning for Human Pose Estimation. ACM Transaction on Multimedia Computing, Communication, and Applications 18, 25 (2022), 1–16.
- Learning 3D human dynamics from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5614–5623.
- End-to-end detection and pose estimation of two interacting hands. In Proceedings of the IEEE International Conference on Computer Vision. 11189–11198.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Interacting Attention Graph for Single Image Two-Hand Reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2761–2770.
- Exploiting learnable joint groups for hand pose estimation. arXiv preprint arXiv:2012.09496 (2020).
- Resolving ambiguous hand pose predictions by exploiting part correlations. IEEE Transactions on Circuits and Systems for Video Technology 25, 7 (2014), 1125–1139.
- Two-hand global 3D pose estimation using monocular RGB. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2373–2381.
- HAN: An efficient hierarchical self-attention network for skeleton-based gesture recognition. arXiv preprint arXiv:2106.13391 (2021).
- 3D interacting hand pose estimation by hand de-occlusion and removal. In Proceedings of the European Conference on Computer Vision. 380–397.
- V2v-posenet: Voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5079–5088.
- InterHand2. 6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In Proceedings of the European Conference on Computer Vision. 548–564.
- Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Transactions on Graphics 38, 4 (2019), 1–13.
- Detecting hands and recognizing physical contact in the wild. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33. 7841–7851.
- Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32. 8026–8037.
- Realtime and robust hand tracking from depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1106–1113.
- Hands in action: real-time 3D reconstruction of hands in interaction with objects. In Proceedings of IEEE International Conference on Robotics and Automation. 458–463.
- Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics 36, 6 (2017), 1–17.
- Robust RGB-D hand tracking using deep learning priors. IEEE Transactions on Circuits and Systems for Video Technology 28, 9 (2017), 2289–2301.
- COHESIV: Contrastive Object and Hand Embedding Segmentation In Video. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34. 5898–5909.
- Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 3633–3642.
- Constraining dense hand surface tracking with elasticity. ACM Transactions on Graphics 39, 6 (2020), 1–14.
- Robust articulated-icp for real-time hand tracking. In Proceedings of the Computer Graphics Forum. 101–114.
- Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE International Conference on Computer Vision. 3325–3333.
- Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics 33, 5 (2014), 1–10.
- Dimitrios Tzionas and Juergen Gall. 2015. 3D object reconstruction from hand-object interactions. In Proceedings of the IEEE International Conference on Computer Vision. 729–737.
- Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.
- Self-supervised 3D hand pose estimation through training by fitting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10853–10862.
- RGB2Hands: real-time tracking of 3D hand interactions from monocular RGB video. ACM Transactions on Graphics 39, 6 (2020), 1–16.
- Video-based hand manipulation capture through composite motion control. ACM Transactions on Graphics 32, 4 (2013), 1–14.
- Mask-pose cascaded cnn for 2D hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology 29, 11 (2018), 3258–3268.
- Realtime Recognition of Dynamic Hand Gestures in Practical Applications. ACM Transaction on Multimedia Computing, Communication, and Applications 18, 25 (2022), 1–16.
- Improve Regression Network on Depth Hand Pose Estimation With Auxiliary Variable. IEEE Transactions on Circuits and Systems for Video Technology 31, 3 (2020), 890–904.
- Seqhand: RGB-sequence-based 3D hand pose and shape estimation. In Proceedings of the European Conference on Computer Vision. 122–139.
- CPF: Learning a contact potential field to model the hand-object interaction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11097–11106.
- Interacting two-hand 3D pose and shape reconstruction from single color image. In Proceedings of the IEEE International Conference on Computer Vision. 11354–11363.
- BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization. Proceedings of the AAAI Conference on Artificial Intelligence, 3597–3605.
- Monocular real-time hand shape and motion capture using multi-modal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5346–5355.
- Christian Zimmermann and Thomas Brox. 2017. Learning to estimate 3D hand pose from single RGB images. In Proceedings of the IEEE International Conference on Computer Vision. 4903–4911.