Emergent Mind

WordRobe: Text-Guided Generation of Textured 3D Garments

(2403.17541)
Published Mar 26, 2024 in cs.CV and cs.GR

Abstract

In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.

WordRobe generates diverse, high-quality unposed 3D garment meshes through simple text prompts.

Overview

  • WordRobe introduces a novel framework for creating textured 3D garments from text prompts, addressing efficiency issues in 3D content creation for applications like virtual try-ons and gaming.

  • The framework consists of modeling 3D garments using a two-stage encoder-decoder strategy, aligning garment latent space with CLIP embedding for text-driven garment generation, and synthesizing textures efficiently.

  • WordRobe outperforms current state-of-the-art methods in generating high-quality 3D garment geometry and texture synthesis, demonstrating faster and more practical results for large-scale production.

  • It facilitates the integration of production-ready 3D garments into cloth simulation and animation pipelines, signifying important advancements in digital fashion and virtual worlds creation.

Text-Guided Generation and Editing of 3D Textured Garments with WordRobe

Introduction

With the surge in 3D content creation driven by applications in virtual try-ons, gaming, and AR/VR, the demand for efficient methods to generate 3D garments has intensified. Traditional techniques either rely on manual design tools or the digitization of real garments, both of which are resource-intensive and difficult to scale. Conversely, recent advancements in text-to-3D generation open avenues for user-friendly garment creation but often fall short in generating high-fidelity, open-surface 3D garments ready for integration into standard graphics pipelines.

WordRobe Framework

WordRobe addresses these challenges by introducing a novel framework for the text-driven generation of textured 3D garments. The framework comprises three main components:

  1. 3D Garment Latent Space: Utilizing a two-stage encoder-decoder strategy to model 3D garments as unsigned distance fields (UDFs), WordRobe learns a rich latent space of unposed garments. It employs a novel disentanglement loss to promote better latent interpolation, facilitating effective manipulation of garment attributes.
  2. CLIP-Guided Garment Generation: By aligning the garment latent space with the CLIP embedding space, WordRobe enables text-driven garment generation. A weakly-supervised training scheme for mapping CLIP embeddings to garment latent codes negates the need for manually annotated datasets.
  3. Texture Synthesis: Leveraging pre-trained text-to-image models, WordRobe synthesizes photorealistic textures in a single feed-forward step, significantly enhancing efficiency compared to existing state-of-the-art (SOTA) methods. By rendering depth maps in front and back views and passing these to ControlNet, WordRobe ensures view-consistent texture generation.

Performance and Contributions

  • Quantitative Evaluation: WordRobe demonstrates superior performance over current SOTAs in learning 3D garment latent spaces. Specifically, it achieves significantly lower Point-to-Surface distance and Chamfer Distance metrics, indicating high-quality garment geometry.
  • Disentanglement Loss: The introduction of a novel disentanglement loss results in a more structured latent space, conducive to better concept separation and latent interpolation.
  • Texture Synthesis Efficiency: Compared to Text2Tex, WordRobe’s optimization-free texture synthesis method not only provides better view consistency but also operates significantly faster, making it a practical alternative for large-scale 3D garment generation.

Implications and Future Directions

WordRobe's efficient generation of high-quality, textured 3D garments from text prompts has practical implications in content creation for virtual environments. Its ability to produce production-ready garment meshes directly integrates with cloth simulation and animation pipelines, thereby streamlining workflow in digital fashion and virtual worlds creation.

The framework also opens avenues for future research, including the exploration of relighting to retain true albedo under varying lighting conditions and the extension to support layered clothing and material properties.

Conclusion

WordRobe marks a significant advancement in the text-driven generation and editing of 3D garments, offering unparalleled efficiency, quality, and practicality. Its innovative contributions to learning a structured garment latent space and view-consistent texture synthesis set new benchmarks in the field, fueling further research and development in 3D content creation for virtual environments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.