WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation (2407.02165v3)
Abstract: Existing human datasets for avatar creation are typically limited to laboratory environments, wherein high-quality annotations (e.g., SMPL estimation from 3D scans or multi-view images) can be ideally provided. However, their annotating requirements are impractical for real-world images or videos, posing challenges toward real-world applications on current avatar creation methods. To this end, we propose the WildAvatar dataset, a web-scale in-the-wild human avatar creation dataset extracted from YouTube, with $10,000+$ different human subjects and scenes. WildAvatar is at least $10\times$ richer than previous datasets for 3D human avatar creation. We evaluate several state-of-the-art avatar creation methods on our dataset, highlighting the unexplored challenges in real-world applications on avatar creation. We also demonstrate the potential for generalizability of avatar creation methods, when provided with data at scale. We publicly release our data source links and annotations, to push forward 3D human avatar creation and other related fields for real-world applications.
- Video scene cut detection and analysis tool. Github, 2014.
- Easymocap - make human motion capture easier. Github, 2021.
- A feature-rich command-line audio/video downloader. Github, 2021.
- Video based reconstruction of 3d people models. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8387–8397. Computer Vision Foundation / IEEE Computer Society, 2018.
- 2d human pose estimation: New benchmark and state of the art analysis. In Proc. IEEE Conf. Comput. Vis. Patt. Recogn., pages 3686–3693. Computer Vision Foundation / IEEE, 2014.
- SCAPE: shape completion and animation of people. ACM Trans. Graph., 24(3):408–416, 2005.
- Neural deformation graphs for globally-consistent non-rigid reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 1450–1459. Computer Vision Foundation / IEEE, 2021.
- Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VII, volume 13667 of Lecture Notes in Computer Science, pages 557–577. Springer, 2022.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022.
- pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In arXiv, 2023.
- Primdiffusion: Volumetric primitives diffusion for 3d human generation. Advances in Neural Information Processing Systems, 36:13664–13677, 2023.
- Gaussianpro: 3d gaussian splatting with progressive propagation. arXiv preprint arXiv:2402.14650, 2024.
- Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 19925–19936. IEEE, 2023.
- ivs-net: Learning human view synthesis from internet videos. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 22885–22894. IEEE, 2023.
- PINA: learning a personalized implicit neural avatar from a single RGB-D video sequence. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20438–20448. IEEE, 2022.
- Reconstructing 3d human pose by watching humans in the mirror. In CVPR, 2021.
- Mps-nerf: Generalizable 3d human rendering from multiview images. CoRR, abs/2203.16875, 2022.
- Learning neural volumetric representations of dynamic humans in minutes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 8759–8770. IEEE, 2023.
- Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 12858–12868. IEEE, 2023.
- Real-time deep dynamic characters. ACM Trans. Graph., 40(4):94:1–94:16, 2021.
- Geo-pifu: Geometry and pixel aligned implicit functions for single-view human reconstruction. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Eva3d: Compositional 3d human generation from 2d image collections. arXiv preprint arXiv:2210.04888, 2022.
- Humanliff: Layer-wise 3d human generation with diffusion model. arXiv preprint arXiv:2308.09712, 2023.
- SHERF: generalizable human nerf from a single image. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 9318–9330. IEEE, 2023.
- Gauhuman: Articulated gaussian splatting from monocular human videos. CoRR, abs/2312.02973, 2023.
- Simhmr: A simple query-based framework for parameterized human mesh reconstruction. In Abdulmotaleb El-Saddik, Tao Mei, Rita Cucchiara, Marco Bertini, Diana Patricia Tobon Vallejo, Pradeep K. Atrey, and M. Shamim Hossain, editors, Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 6918–6927. ACM, 2023.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1325–1339, 2014.
- Humanrf: High-fidelity neural radiance fields for humans in motion. ACM Trans. Graph., 42(4):160:1–160:12, 2023.
- Learning high fidelity depths of dressed humans by watching social media dance videos. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 12753–12762. Computer Vision Foundation / IEEE, 2021.
- Neuman: Neural human radiance field from a single video. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXII, volume 13692 of Lecture Notes in Computer Science, pages 402–418. Springer, 2022.
- Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In Proc. Int. Conf. 3D Vis., pages 42–52. Computer Vision Foundation / IEEE, 2021.
- End-to-end recovery of human shape and pose. In Proc. IEEE Conf. Comput. Vis. Patt. Recogn., pages 7122–7131. Computer Vision Foundation / IEEE, 2018.
- Segment anything. arXiv:2304.02643, 2023.
- VIBE: video inference for human body pose and shape estimation. In Proc. IEEE Conf. Comput. Vis. Patt. Recogn., pages 5252–5262. Computer Vision Foundation / IEEE, 2020.
- PARE: part attention regressor for 3d human body estimation. In Proc. IEEE Int. Conf. Comput. Vis., pages 11107–11117. Computer Vision Foundation / IEEE, 2021.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proc. IEEE Int. Conf. Comput. Vis., pages 2252–2261. Computer Vision Foundation / IEEE, 2019.
- Neural human performer: Learning generalizable radiance fields for human performance rendering. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 24741–24752, 2021.
- Monocular real-time volumetric performance capture. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIII, volume 12368 of Lecture Notes in Computer Science, pages 49–67. Springer, 2020.
- Robust and accurate 3d self-portraits in seconds. IEEE Trans. Pattern Anal. Mach. Intell., 44(11):7854–7870, 2022.
- CLIFF: carrying location information in full frames into human pose and shape estimation. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Proc. Eur. Conf. Comput. Vis., volume 13665 of Lecture Notes in Computer Science, pages 590–606. Springer-Verlag, 2022.
- Efficient neural radiance fields for interactive free-viewpoint video. In Soon Ki Jung, Jehee Lee, and Adam W. Bargteil, editors, SIGGRAPH Asia 2022 Conference Papers, SA 2022, Daegu, Republic of Korea, December 6-9, 2022, pages 39:1–39:9. ACM, 2022.
- Motion-x: A large-scale 3d expressive whole-body human motion dataset. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Real-time high-resolution background matting. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 8762–8771. Computer Vision Foundation / IEEE, 2021.
- Fast generalizable gaussian splatting reconstruction from multi-view stereo. arXiv preprint arXiv:2405.12218, 2024.
- Humangaussian: Text-driven 3d human generation with gaussian splatting. arXiv preprint arXiv:2311.17061, 2023.
- SMPL: a skinned multi-person linear model. ACM Trans. Graph., 34(6):248:1–248:16, 2015.
- Monocular 3d human pose estimation in the wild using improved CNN supervision. In Proc. Int. Conf. 3D Vis., pages 506–516. Computer Vision Foundation / IEEE, 2017.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Expressive body capture: 3d hands, face, and body from a single image. In Proc. IEEE Conf. Comput. Vis. Patt. Recogn., pages 10975–10985. Computer Vision Foundation / IEEE, 2019.
- Animatable neural radiance fields for modeling dynamic human bodies. In ICCV, 2021.
- Animatable implicit neural representations for creating realistic avatars from videos. TPAMI, 2024.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proc. IEEE Conf. Comput. Vis. Patt. Recogn., pages 9054–9063. Computer Vision Foundation / IEEE, 2021.
- Visualizing high-dimensional temporal data using direction-aware t-sne. CoRR, abs/2403.19040, 2024.
- You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 779–788. IEEE Computer Society, 2016.
- Renderpeople. Renderpeople, 2018. https://renderpeople.com/3d-people,.
- Embodied hands: Modeling and capturing hands and bodies together. CoRR, abs/2201.02610, 2022.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proc. IEEE Int. Conf. Comput. Vis., pages 2304–2314. Computer Vision Foundation / IEEE, 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proc. IEEE Conf. Comput. Vis. Patt. Recogn., pages 81–90. Computer Vision Foundation / IEEE, 2020.
- Image quality assessment through fsim, ssim, mse and psnr—a comparative study. Journal of Computer and Communications, page 8–18, Jan 2019.
- Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In ECCV, 2022.
- Deepcloth: Neural garment representation for shape and style editing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1581–1593, 2023.
- Monocular, one-stage, regression of multiple 3d people. In Proc. IEEE Int. Conf. Comput. Vis., pages 11159–11168. Computer Vision Foundation / IEEE, 2021.
- Advances in neural rendering. Comput. Graph. Forum, 41(2):703–735, 2022.
- AIST dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing. In Arthur Flexer, Geoffroy Peeters, Julián Urbano, and Anja Volk, editors, Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, The Netherlands, November 4-8, 2019, pages 501–510, 2019.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Proc. Eur. Conf. Comput. Vis., volume 11214 of Lecture Notes in Computer Science, pages 614–631. Springer-Verlag, 2018.
- Freeman: Towards benchmarking 3d human pose estimation in the wild. CoRR, abs/2309.05073, 2023.
- Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4563–4573, 2023.
- Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
- Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proc. IEEE Conf. Comput. Vis. Patt. Recogn., pages 16189–16199. Computer Vision Foundation / IEEE, 2022.
- Multi-view neural human rendering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 1679–1688. Computer Vision Foundation / IEEE, 2020.
- GHUM & GHUML: generative 3d human shape and articulated pose models. In Proc. IEEE Conf. Comput. Vis. Patt. Recogn., pages 6183–6192. Computer Vision Foundation / IEEE, 2020.
- S3: neural shape, skeleton, and skinning fields for 3d human modeling. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 13284–13293. Computer Vision Foundation / IEEE, 2021.
- Hi4d: 4d instance segmentation of close human interaction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 17016–17027. IEEE, 2023.
- HUMBI: A large multiview dataset of human body expressions and benchmark challenge. IEEE Trans. Pattern Anal. Mach. Intell., 45(1):623–640, 2023.
- Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), June 2021.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proc. IEEE Int. Conf. Comput. Vis., pages 11426–11436. Computer Vision Foundation / IEEE, 2021.
- The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–595. Computer Vision Foundation / IEEE Computer Society, 2018.
- Light-weight multi-person total capture using sparse multi-view cameras. In IEEE International Conference on Computer Vision, 2021.
- Pku-dymvhumans: A multi-view video benchmark for high-fidelity dynamic human modeling. CoRR, abs/2403.16080, 2024.
- Structured local radiance fields for human avatar modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
- Deephuman: 3d human reconstruction from a single image. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 7738–7748. IEEE, 2019.
- Relightable neural human assets from multi-view gradient illuminations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4315–4327. IEEE, 2023.
- Zihao Huang (42 papers)
- Shoukang Hu (38 papers)
- Guangcong Wang (25 papers)
- Tianqi Liu (49 papers)
- Yuhang Zang (54 papers)
- Zhiguo Cao (88 papers)
- Wei Li (1122 papers)
- Ziwei Liu (368 papers)