Emergent Mind

Sapiens: Foundation for Human Vision Models

(2408.12569)
Published Aug 22, 2024 in cs.CV

Abstract

We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error. Project page: $\href{https://about.meta.com/realitylabs/codecavatars/sapiens}{\text{this https URL}}$.

Finetuned Sapiens models perform 2D pose estimation, body-part segmentation, depth, and normal prediction tasks.

Overview

  • The paper 'Sapiens: Foundation for Human Vision Models' introduces a suite of vision models specifically designed for human-centric tasks, including 2D pose estimation, body-part segmentation, depth prediction, and surface normal estimation.

  • The models, fine-tuned using a human-specific dataset named Humans-300M, leverage self-supervised pretraining methodologies to yield high performance within a fixed computational budget.

  • Extensive experiments demonstrate that the Sapiens models outperform existing state-of-the-art methods across multiple benchmarks, establishing new performance standards in the field.

Sapiens: Foundation for Human Vision Models

Introduction

The paper "Sapiens: Foundation for Human Vision Models" introduces a family of vision models designed for human-centric tasks. The Sapiens models address four primary tasks: 2D pose estimation, body-part segmentation, depth prediction, and surface normal estimation. These models are fine-tuned from a large-scale, human-specific dataset, Humans-300M, consisting of over 300 million diverse in-the-wild images. The models capitalize on self-supervised pretraining methodologies, verifying the hypothesis that a curated dataset of human images can significantly enhance the performance of these tasks within a fixed computational budget.

Methodology

Data Collection and Pretraining

The foundational dataset, Humans-300M, encompasses approximately 300 million images selected through a stringent filtering process aimed at maximizing data quality and relevance to human-centric tasks. Pretraining is performed using the masked autoencoder (MAE) approach, which reconstructs partial inputs to generate comprehensive human image representations. This method affords scalability and efficiency, allowing the models to adapt to varying human configurations and contexts successfully.

Model Architecture

Four different-sized models (ranging from 0.3B to 2B parameters) are defined under the Sapiens architecture. The models utilize a high resolution of 1024 pixels for pretraining and maintain a consistent encoder-decoder architecture for fine-tuning on specific tasks. The model specifications, focusing on different aspects such as the number of parameters, hidden size, layers, and FLOPs, are tailored to achieve optimal balance between performance and computational cost.

Experimental Results

Pose Estimation

The Sapiens models were evaluated on several benchmarks, including the Humans-5K test set. The evaluation covered different scales, ranging from face-only to full-body keypoints, with notable detail in facial keypoints. The results showed that even the smallest Sapiens model (0.3B) surpassed previous state-of-the-art models by a significant margin, with Sapiens-2B setting a new benchmark for pose estimation.

Body-Part Segmentation

For segmentation, the models were fine-tuned using an enhanced vocabulary of 28 body-part categories. Evaluations on the Humans-2K test set demonstrated substantial performance gains, with the highest-performing model yielding 81.2% mIoU and 89.4% mAcc, significantly exceeding results from existing segmentation models such as DeepLabV3+ and Mask2Former.

Depth Estimation

When tested against datasets like THuman2.0 and Hi4D, which include single-human and multi-human scenes respectively, the Sapiens models achieved lower RMSE scores across all scales. Notably, the Sapiens-2B model demonstrated a RMSE reduction of 20% when compared to the best-performing models, reinforcing its efficacy.

Surface Normal Estimation

The models were also tested for surface normal estimation on the same datasets used for depth evaluation. The Sapiens models consistently outperformed state-of-the-art methods such as PIFuHD and ECON. Particularly, Sapiens-2B achieved the best results with a mean angular error around 12°, highlighting its robust generalization capabilities.

Discussion

The research emphasizes the significance of domain-specific pretraining data, observing that human-centric datasets dramatically improve performance compared to general datasets. Results indicate a direct correlation between the quantity of unique human images used during pretraining and the models' performance.

Despite the promising results, the authors acknowledge that certain complexities such as rare poses, crowded scenes, and severe occlusions challenge the models' performance. Future directions involve extending the models to handle 3D and multi-modal data to further enhance their capabilities.

Conclusion

In conclusion, the Sapiens models present a significant advancement in human-centric vision tasks by leveraging large-scale human-specific pretraining. By scaling vision transformer architectures and incorporating high-quality annotations, Sapiens sets new performance benchmarks across several tasks. This work provides a solid foundation for future developments in human vision models, making substantial contributions to the field.

The comprehensive and well-executed methodology underscores the potential of domain-specific large-scale pretraining. Notably, the scalable architecture and stringent data curation bring forward models that can generalize well to real-world scenarios, making Sapiens a formidable tool for various downstream human-centric applications. The paper signifies a forward step in the continual evolution of AI, setting a benchmark for future research and development.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews