ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning (2307.01200v3)
Abstract: Learning-based approaches to monocular motion capture have recently shown promising results by learning to regress in a data-driven manner. However, due to the challenges in data collection and network designs, it remains challenging for existing solutions to achieve real-time full-body capture while being accurate in world space. In this work, we introduce ProxyCap, a human-centric proxy-to-motion learning scheme to learn world-space motions from a proxy dataset of 2D skeleton sequences and 3D rotational motions. Such proxy data enables us to build a learning-based network with accurate world-space supervision while also mitigating the generalization issues. For more accurate and physically plausible predictions in world space, our network is designed to learn human motions from a human-centric perspective, which enables the understanding of the same motion captured with different camera trajectories. Moreover, a contact-aware neural motion descent module is proposed in our network so that it can be aware of foot-ground contact and motion misalignment with the proxy observations. With the proposed learning-based solution, we demonstrate the first real-time monocular full-body capture system with plausible foot-ground contact in world space even using hand-held moving cameras. Our project page is https://zhangyux15.github.io/ProxyCapV2.
- 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, pages 3686–3693, 2014.
- Hspace: Synthetic parametric humans animated in complex environments. arXiv preprint arXiv:2112.12867, 2021.
- BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In CVPR, pages 8726–8737, 2023.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, pages 561–578. Springer, 2016.
- Monocular expressive body regression through body-driven attention. In ECCV, pages 20–40. Springer, 2020.
- Learned vertex descent: A new direction for 3D human model fitting. In ECCV, pages 146–165, 2022.
- Collaborative regression of expressive bodies using moderation. In I3DV, pages 792–804, 2021a.
- Learning an animatable detailed 3D face model from in-the-wild images. TOG, 40(4):88:1–88:13, 2021b.
- Posetriplet: co-evolving 3d human pose estimation, imitation, and hallucination under self-supervision. In CVPR, pages 11017–11027, 2022.
- Estimating human shape and pose from a single image. In ICCV, pages 1381–1388. IEEE, 2009.
- Capturing and inferring dense full-body human-scene contact. In CVPR, pages 13274–13285, 2022.
- Towards accurate marker-less human shape and pose estimation over time. In I3DV, pages 421–430. IEEE, 2017.
- Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI, 36(7):1325–1339, 2014.
- Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In I3DV, pages 42–52, 2021.
- End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
- PARE: Part attention regressor for 3D human body estimation. In ICCV, pages 11127–11137, 2021a.
- SPEC: Seeing people in the wild with an estimated camera. In ICCV, pages 11035–11045, 2021b.
- PACE: Human and motion estimation from in-the-wild videos. In 3DV, 2024.
- Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In ICCV, pages 2252–2261, 2019.
- Unite the people: Closing the loop between 3D and 2D human representations. In CVPR, pages 6050–6059, 2017.
- HybrIK: A hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
- Interacting attention graph for single image two-hand reconstruction. In CVPR, pages 2761–2770, 2022.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Character controllers using motion vaes. ACM Trans. Graph., 39(4), 2020.
- Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
- Embodied scene-aware human pose estimation. In NeurIPS, 2022.
- 3d human mesh estimation from virtual markers. In CVPR, pages 534–543, 2023.
- AMASS: Archive of motion capture as surface shapes. In ICCV, pages 5442–5451, 2019.
- Monocular 3D human pose estimation in the wild using improved CNN supervision. In I3DV, pages 506–516, 2017.
- I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In ECCV, pages 752–768. Springer, 2020a.
- Pose2Pose: 3D positional pose-guided 3D rotational pose prediction for expressive 3D human pose and mesh estimation. arXiv preprint arXiv:2011.11534, 2020b.
- InterHand2.6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In ECCV, pages 548–564. Springer, 2020.
- NeuralAnnot: Neural annotator for 3D human mesh training sets. In CVPRW, pages 2299–2307, 2022a.
- Accurate 3D hand pose estimation for whole-body 3D human mesh estimation. In CVPRW, 2022b.
- On self-contact and human pose. In CVPR, pages 9990–9999, 2021.
- Neural body fitting: Unifying deep learning and model-based human pose and shape estimation. In I3DV, pages 484–494. IEEE, 2018.
- AGORA: Avatars in geography optimized for regression analysis. In CVPR, pages 13468–13478, 2021.
- Learning to estimate 3D human pose and shape from a single color image. In CVPR, pages 459–468, 2018.
- Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
- 3D human pose estimation in video with temporal convolutions and semi-supervised training. pages 7753–7762, 2019.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, pages 9054–9063, 2021.
- Tracking people by predicting 3D appearance, location and pose. In CVPR, pages 2740–2749, 2022.
- Contact and human dynamics from monocular video. In ECCV, pages 71–87. Springer, 2020.
- HuMoR: 3D human motion model for robust pose estimation. In ICCV, 2021.
- FrankMocap: A monocular 3D whole-body pose estimation system via regression and integration. In ICCV, 2021.
- Chained representation cycling: Learning to estimate 3D human pose and shape by cycling between representations. In AAAI, pages 5561–5569, 2020.
- Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Synthetic training for accurate 3D human pose and shape estimation in the wild. In BMVC, 2020.
- Combined discriminative and generative articulated pose and non-rigid shape estimation. In NeurIPS, pages 1337–1344, 2008.
- HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1-2):4, 2010.
- Human body model fitting by learned gradient descent. In ECCV, pages 744–760. Springer, 2020.
- Virtualpose: Learning generalizable 3d human pose models from virtual data. In ECCV, pages 55–71. Springer, 2022.
- Human mesh recovery from monocular images via a skeleton-disentangled representation. In ICCV, pages 5349–5358, 2019.
- TRACE: 5D temporal regression of avatars with dynamic cameras in 3D environments. In CVPR, pages 8856–8866, 2023.
- DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Advances in neural information processing systems, 2021.
- Recovering 3D human mesh from monocular images: A survey. TPAMI, 2023.
- Self-supervised learning of motion capture. NeurIPS, pages 5236–5246, 2017.
- Learning from synthetic humans. In CVPR, pages 109–117, 2017.
- BodyNet: Volumetric inference of 3D human body shapes. In ECCV, pages 20–36, 2018.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
- DenseRaC: Joint 3D pose and shape estimation by dense render-and-compare. In ICCV, pages 7760–7770, 2019.
- Decoupling human and camera motion from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Human-aware object placement for visual environment reconstruction. In CVPR, pages 3959–3970, 2022.
- HUMBI: A large multiview dataset of human body expressions. In CVPR, pages 2990–3000, 2020.
- SimPoE: Simulated character control for 3D human pose estimation. In CVPR, pages 7159–7169, 2021.
- GLAMR: Global occlusion-aware human mesh recovery with dynamic cameras. In CVPR, pages 11038–11049, 2022.
- Neural descent for visual 3D human pose and shape. In CVPR, pages 14484–14493, 2021.
- SmoothNet: a plug-and-play network for refining human poses in videos. In ECCV, pages 625–642. Springer, 2022.
- PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In ICCV, pages 11446–11456, 2021a.
- Learning 3D human shape and pose from dense body parts. TPAMI, 2022a.
- PyMAF-X: Towards well-aligned full-body model regression from monocular images. TPAMI, 2023.
- Learning motion priors for 4D human body capture in 3D scenes. In ICCV, pages 11343–11353, 2021b.
- Egobody: Human body shape and motion of interacting people from head-mounted devices. In European conference on computer vision (ECCV), 2022b.
- 4d association graph for realtime multi-person motion capture using multiple video cameras. In CVPR, pages 1324–1333, 2020.
- Lightweight multi-person total motion capture using sparse multi-view cameras. In ICCV, pages 5560–5569, 2021c.
- 3D human pose estimation with spatial and temporal transformers. In ICCV, pages 11656–11665, 2021.
- Monocular real-time full body capture with inter-part correlations. In CVPR, pages 4811–4822, 2021.
- Yuxiang Zhang (104 papers)
- Hongwen Zhang (59 papers)
- Liangxiao Hu (3 papers)
- Jiajun Zhang (176 papers)
- Hongwei Yi (28 papers)
- Shengping Zhang (41 papers)
- Yebin Liu (115 papers)