Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning (2307.01200v3)

Published 3 Jul 2023 in cs.CV

Abstract: Learning-based approaches to monocular motion capture have recently shown promising results by learning to regress in a data-driven manner. However, due to the challenges in data collection and network designs, it remains challenging for existing solutions to achieve real-time full-body capture while being accurate in world space. In this work, we introduce ProxyCap, a human-centric proxy-to-motion learning scheme to learn world-space motions from a proxy dataset of 2D skeleton sequences and 3D rotational motions. Such proxy data enables us to build a learning-based network with accurate world-space supervision while also mitigating the generalization issues. For more accurate and physically plausible predictions in world space, our network is designed to learn human motions from a human-centric perspective, which enables the understanding of the same motion captured with different camera trajectories. Moreover, a contact-aware neural motion descent module is proposed in our network so that it can be aware of foot-ground contact and motion misalignment with the proxy observations. With the proposed learning-based solution, we demonstrate the first real-time monocular full-body capture system with plausible foot-ground contact in world space even using hand-held moving cameras. Our project page is https://zhangyux15.github.io/ProxyCapV2.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, pages 3686–3693, 2014.
  2. Hspace: Synthetic parametric humans animated in complex environments. arXiv preprint arXiv:2112.12867, 2021.
  3. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In CVPR, pages 8726–8737, 2023.
  4. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, pages 561–578. Springer, 2016.
  5. Monocular expressive body regression through body-driven attention. In ECCV, pages 20–40. Springer, 2020.
  6. Learned vertex descent: A new direction for 3D human model fitting. In ECCV, pages 146–165, 2022.
  7. Collaborative regression of expressive bodies using moderation. In I3DV, pages 792–804, 2021a.
  8. Learning an animatable detailed 3D face model from in-the-wild images. TOG, 40(4):88:1–88:13, 2021b.
  9. Posetriplet: co-evolving 3d human pose estimation, imitation, and hallucination under self-supervision. In CVPR, pages 11017–11027, 2022.
  10. Estimating human shape and pose from a single image. In ICCV, pages 1381–1388. IEEE, 2009.
  11. Capturing and inferring dense full-body human-scene contact. In CVPR, pages 13274–13285, 2022.
  12. Towards accurate marker-less human shape and pose estimation over time. In I3DV, pages 421–430. IEEE, 2017.
  13. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI, 36(7):1325–1339, 2014.
  14. Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In I3DV, pages 42–52, 2021.
  15. End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
  16. PARE: Part attention regressor for 3D human body estimation. In ICCV, pages 11127–11137, 2021a.
  17. SPEC: Seeing people in the wild with an estimated camera. In ICCV, pages 11035–11045, 2021b.
  18. PACE: Human and motion estimation from in-the-wild videos. In 3DV, 2024.
  19. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In ICCV, pages 2252–2261, 2019.
  20. Unite the people: Closing the loop between 3D and 2D human representations. In CVPR, pages 6050–6059, 2017.
  21. HybrIK: A hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
  22. Interacting attention graph for single image two-hand reconstruction. In CVPR, pages 2761–2770, 2022.
  23. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  24. Character controllers using motion vaes. ACM Trans. Graph., 39(4), 2020.
  25. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  26. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
  27. Embodied scene-aware human pose estimation. In NeurIPS, 2022.
  28. 3d human mesh estimation from virtual markers. In CVPR, pages 534–543, 2023.
  29. AMASS: Archive of motion capture as surface shapes. In ICCV, pages 5442–5451, 2019.
  30. Monocular 3D human pose estimation in the wild using improved CNN supervision. In I3DV, pages 506–516, 2017.
  31. I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In ECCV, pages 752–768. Springer, 2020a.
  32. Pose2Pose: 3D positional pose-guided 3D rotational pose prediction for expressive 3D human pose and mesh estimation. arXiv preprint arXiv:2011.11534, 2020b.
  33. InterHand2.6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In ECCV, pages 548–564. Springer, 2020.
  34. NeuralAnnot: Neural annotator for 3D human mesh training sets. In CVPRW, pages 2299–2307, 2022a.
  35. Accurate 3D hand pose estimation for whole-body 3D human mesh estimation. In CVPRW, 2022b.
  36. On self-contact and human pose. In CVPR, pages 9990–9999, 2021.
  37. Neural body fitting: Unifying deep learning and model-based human pose and shape estimation. In I3DV, pages 484–494. IEEE, 2018.
  38. AGORA: Avatars in geography optimized for regression analysis. In CVPR, pages 13468–13478, 2021.
  39. Learning to estimate 3D human pose and shape from a single color image. In CVPR, pages 459–468, 2018.
  40. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
  41. 3D human pose estimation in video with temporal convolutions and semi-supervised training. pages 7753–7762, 2019.
  42. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, pages 9054–9063, 2021.
  43. Tracking people by predicting 3D appearance, location and pose. In CVPR, pages 2740–2749, 2022.
  44. Contact and human dynamics from monocular video. In ECCV, pages 71–87. Springer, 2020.
  45. HuMoR: 3D human motion model for robust pose estimation. In ICCV, 2021.
  46. FrankMocap: A monocular 3D whole-body pose estimation system via regression and integration. In ICCV, 2021.
  47. Chained representation cycling: Learning to estimate 3D human pose and shape by cycling between representations. In AAAI, pages 5561–5569, 2020.
  48. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  49. Synthetic training for accurate 3D human pose and shape estimation in the wild. In BMVC, 2020.
  50. Combined discriminative and generative articulated pose and non-rigid shape estimation. In NeurIPS, pages 1337–1344, 2008.
  51. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1-2):4, 2010.
  52. Human body model fitting by learned gradient descent. In ECCV, pages 744–760. Springer, 2020.
  53. Virtualpose: Learning generalizable 3d human pose models from virtual data. In ECCV, pages 55–71. Springer, 2022.
  54. Human mesh recovery from monocular images via a skeleton-disentangled representation. In ICCV, pages 5349–5358, 2019.
  55. TRACE: 5D temporal regression of avatars with dynamic cameras in 3D environments. In CVPR, pages 8856–8866, 2023.
  56. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Advances in neural information processing systems, 2021.
  57. Recovering 3D human mesh from monocular images: A survey. TPAMI, 2023.
  58. Self-supervised learning of motion capture. NeurIPS, pages 5236–5246, 2017.
  59. Learning from synthetic humans. In CVPR, pages 109–117, 2017.
  60. BodyNet: Volumetric inference of 3D human body shapes. In ECCV, pages 20–36, 2018.
  61. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
  62. DenseRaC: Joint 3D pose and shape estimation by dense render-and-compare. In ICCV, pages 7760–7770, 2019.
  63. Decoupling human and camera motion from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  64. Human-aware object placement for visual environment reconstruction. In CVPR, pages 3959–3970, 2022.
  65. HUMBI: A large multiview dataset of human body expressions. In CVPR, pages 2990–3000, 2020.
  66. SimPoE: Simulated character control for 3D human pose estimation. In CVPR, pages 7159–7169, 2021.
  67. GLAMR: Global occlusion-aware human mesh recovery with dynamic cameras. In CVPR, pages 11038–11049, 2022.
  68. Neural descent for visual 3D human pose and shape. In CVPR, pages 14484–14493, 2021.
  69. SmoothNet: a plug-and-play network for refining human poses in videos. In ECCV, pages 625–642. Springer, 2022.
  70. PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In ICCV, pages 11446–11456, 2021a.
  71. Learning 3D human shape and pose from dense body parts. TPAMI, 2022a.
  72. PyMAF-X: Towards well-aligned full-body model regression from monocular images. TPAMI, 2023.
  73. Learning motion priors for 4D human body capture in 3D scenes. In ICCV, pages 11343–11353, 2021b.
  74. Egobody: Human body shape and motion of interacting people from head-mounted devices. In European conference on computer vision (ECCV), 2022b.
  75. 4d association graph for realtime multi-person motion capture using multiple video cameras. In CVPR, pages 1324–1333, 2020.
  76. Lightweight multi-person total motion capture using sparse multi-view cameras. In ICCV, pages 5560–5569, 2021c.
  77. 3D human pose estimation with spatial and temporal transformers. In ICCV, pages 11656–11665, 2021.
  78. Monocular real-time full body capture with inter-part correlations. In CVPR, pages 4811–4822, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yuxiang Zhang (104 papers)
  2. Hongwen Zhang (59 papers)
  3. Liangxiao Hu (3 papers)
  4. Jiajun Zhang (176 papers)
  5. Hongwei Yi (28 papers)
  6. Shengping Zhang (41 papers)
  7. Yebin Liu (115 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com