Emergent Mind

Abstract

In this era, the success of LLMs and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.

Overview

  • MVHumanNet is a large-scale dataset with 645 million frames featuring 3D human figures in everyday clothing, aimed at enhancing AI models in computer vision.

  • The dataset contains 4,500 human subjects, 9,000 outfits, 60,000 motion sequences, and detailed annotations such as human masks, keypoints, and body models.

  • A multi-view capture system with up to 48 cameras was used, capturing a diverse range of human ages, body types, motions, and clothing styles.

  • Experiments with MVHumanNet show improved performance in tasks like action recognition, human reconstruction with NeRF, and text-driven image generation.

  • MVHumanNet advances the capacity of generative models to produce textured 3D meshes and marks a stride in the evolution of digital human representation.

In the realm of computer vision and AI, the development of models capable of understanding and generating 3D human figures is a rapidly advancing area. The availability of large and diverse datasets has proven instrumental in enhancing the performance of AI models across various domains, from language processing to image synthesis. A recent addition to this field is the creation of MVHumanNet—arguably the most extensive dataset to date focusing on 3D representations of humans in everyday clothing.

MVHumanNet encompasses a considerable scale, boasting details from 4,500 individual human subjects dressed in 9,000 different outfits and captured in 60,000 motion sequences, resulting in an extraordinary 645 million frames of data. What sets this dataset apart is not just its volume, but the depth of its annotations, essential for the nuanced understanding and generation of human figures. Annotations include detailed human masks, camera calibration parameters, 2D and 3D keypoints, SMPL/SMPLX body models, and correlated textual metadata.

The creation of MVHumanNet involved a multi-view capture system with up to 48 high-resolution cameras, which allowed for efficient data gathering while covering a broad spectrum of human ages, body types, motions, and clothing styles. The diverse range of daily outfits and actions ensures that the dataset emulates real-world human appearances and activities.

To demonstrate the efficacy of MVHumanNet, researchers conducted a series of experiments spanning from action recognition to novel view synthesis and generative modeling tasks. In the action recognition trial, models trained on MVHumanNet showed improved accuracy when the number of camera views increased, demonstrating that more data translates to better understanding and prediction. In terms of human reconstruction using Neural Radiance Fields (NeRF), experiments indicated that more substantial training data sets facilitated the generalization of models to novel poses and clothing types, which is critical for creating accurate and versatile digital human representations.

Furthermore, the dataset proved valuable in text-driven image generation tasks, where the ability to generate high-quality human images consistent with SMPL conditions and textual descriptions was significantly enhanced with increased training data scale. The implications of such capabilities extend to creating avatars, fashion modeling, and even virtual try-on applications.

Lastly, the dataset facilitated the creation of generative models that could produce textured 3D meshes from high-resolution full-body images. This marks a significant leap forward from prior datasets that either relied on synthetic data or operated in restricted view-settings. The results suggest that scaling up data has a marked positive effect on performance, a promising signal for future research endeavors in 3D human model generation.

In summary, MVHumanNet is a transformative new dataset that stands to accelerate progress significantly in several branches of AI that deal with digital human representation and generation. It exemplifies the power of large-scale, detailed datasets to push the boundaries of what AI models can achieve, providing the groundwork for a future where the virtual representation of humans is as detailed and nuanced as their real-world counterparts.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.