TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

Published 19 Mar 2024 in cs.CV | (2403.12906v1)

Abstract: Texturing 3D humans with semantic UV maps remains a challenge due to the difficulty of acquiring reasonably unfolded UV. Despite recent text-to-3D advancements in supervising multi-view renderings using large text-to-image (T2I) models, issues persist with generation speed, text consistency, and texture quality, resulting in data scarcity among existing datasets. We present TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model. Utilizing an efficient texture adaptation finetuning strategy, we adapt large T2I model to a semantic UV structure while preserving its original generalization capability. Leveraging a novel feature translator module, the trained model is capable of generating high-fidelity 3D human textures from either text or image within seconds. Furthermore, we introduce ArTicuLated humAn textureS (ATLAS), the largest high-resolution (1024 X 1024) 3D human texture dataset which contains 50k high-fidelity textures with text descriptions.

Abstract PDF Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper introduces TexDreamer, a zero-shot method that uses feature translation and semantic UV maps to generate high-fidelity 3D human textures.
It employs texture adaptation fine-tuning with the novel ATLAS dataset, achieving superior text-to-texture consistency and enhanced UV quality over prior methods.
Empirical evaluations, including improved CLIP scores and user studies, underscore TexDreamer’s potential for creating realistic avatars in gaming, VR, and film.

Evaluation of TexDreamer: A Model for High-Fidelity 3D Human Texture Generation

The paper introduces TexDreamer, an innovative method for zero-shot, high-fidelity 3D human texture generation that utilizes semantic UV maps. Acknowledging the challenges in texturing 3D humans, the authors propose a multimodal model capable of deriving textures from textual descriptions and images. This advancement is particularly significant given the existing issues with text-to-3D methods, which include constraints in texture quality and generation efficiency.

Technological Advancements and Methodology

TexDreamer builds upon the architecture of large text-to-image (T2I) models, adapting their capabilities for creating 3D textures. The process employs texture adaptation fine-tuning along with a feature translator module, which enables the translation of text or image inputs into high-fidelity textures. The model’s strength lies in its ability to seamlessly integrate semantic and positional information from input data, thereby producing coherent and high-quality 3D representations of human textures.

To support this research, the authors introduce the ATLAS dataset, which is the largest collection of 3D human textures paired with text descriptions to date. The ATLAS dataset contains high-resolution ( $1,024 \times 1,024$ ) textures and augments training capabilities by providing comprehensive semantic mappings within the potential SMPL UV space. This dataset is crucial for training TexDreamer, especially given the scarcity of high-quality texture data for 3D human models.

Empirical Evaluation

The empirical results presented demonstrate that TexDreamer surpasses existing methodologies in terms of both text-to-texture consistency and UV quality. The authors perform extensive quantitative and qualitative analyses, revealing superior CLIP scores and demonstrating the model's efficiency relative to prior approaches such as AvatarClip and AvatarCraft, both of which are notable for their extended processing times and resource consumption. Furthermore, the authors conducted a user study, which substantiated TexDreamer’s advantages in text consistency and image quality through human evaluation metrics.

Implications and Future Directions

The TexDreamer model signifies an important advancement in the field of computer graphics, particularly in the creation of realistic 3D human avatars for applications in film, gaming, and virtual reality. The inclusion of the feature translator module opens avenues for future exploration in enhancing text-to-3D rendering tasks across various domains, including but not limited to, gaming and virtual reality. Moreover, the introduction of the ATLAS dataset addresses existing data scarcity issues, offering valuable resources for further research and development.

Despite the model’s promising achievements, the authors acknowledge several limitations. The potential for misalignment when generating realistic human textures from actual images suggests a need for further improvements in image-to-UV techniques. Furthermore, the paper highlights ethical considerations, particularly the potential misuse of technology resulting in unauthorized digital replicas of real individuals.

Conclusion

TexDreamer stands as a robust contribution to zero-shot high-fidelity 3D texture generation, coupling advanced model adaptation techniques with a novel data resource. Its dual capability to generate textures from both text and image input is poised to significantly impact the production of realistic 3D human models. Future research should focus on enhancing generalization capabilities and addressing ethical challenges, thereby harnessing the full potential of TexDreamer in commercial and creative applications.

Markdown Report Issue