Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StructLDM: Structured Latent Diffusion for 3D Human Generation (2404.01241v3)

Published 1 Apr 2024 in cs.CV

Abstract: Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can be decoded to render view-consistent humans under different poses and clothing styles. 3) A structured latent diffusion model for generative human appearance sampling. Extensive experiments validate StructLDM's state-of-the-art generation performance and illustrate the expressiveness of the structured latent space over the well-adopted 1D latent space. Notably, StructLDM enables different levels of controllable 3D human generation and editing, including pose/view/shape control, and high-level tasks including compositional generations, part-aware clothing editing, 3D virtual try-on, etc. Our project page is at: https://taohuumd.github.io/projects/StructLDM/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. Gaussian shell maps for efficient 3d human generation, 2023.
  2. Generative neural articulated radiance fields. ArXiv, abs/2206.14314, 2022a. URL https://api.semanticscholar.org/CorpusID:250113850.
  3. Generative neural articulated radiance fields. arXiv preprint arXiv:2206.14314, 2022b.
  4. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023a.
  5. Guide3d: Create 3d avatars from text and image guidance. arXiv preprint arXiv:2308.09705, 2023b.
  6. Efficient geometry-aware 3d generative adversarial networks. ArXiv, abs/2112.07945, 2021a.
  7. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5799–5809, 2021b.
  8. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  9. gdna: Towards generative detailed neural avatars. arXiv, 2022a.
  10. Uv volumes for real-time rendering of editable free-view human performance. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16621–16631, 2022b. URL https://api.semanticscholar.org/CorpusID:247762811.
  11. Primdiffusion: Volumetric primitives diffusion for 3d human generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  12. Ag3d: Learning to generate 3d avatars from 2d image collections. ArXiv, abs/2305.02312, 2023. URL https://api.semanticscholar.org/CorpusID:258461509.
  13. Taming transformers for high-resolution image synthesis, 2020.
  14. Insetgan for full-body image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7723–7732, 2022.
  15. Stylegan-human: A data-centric odyssey of human generation. In European Conference on Computer Vision, 2022. URL https://api.semanticscholar.org/CorpusID:248377018.
  16. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  17. Stylepeople: A generative model of fullbody human avatars. 2021 (CVPR), pages 5147–5156, 2021.
  18. Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099, 2020.
  19. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning, pages 11808–11826. PMLR, 2023.
  20. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  21. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. URL https://api.semanticscholar.org/CorpusID:206594692.
  22. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  23. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  24. Eva3d: Compositional 3d human generation from 2d image collections. ArXiv, abs/2210.04888, 2022a. URL https://api.semanticscholar.org/CorpusID:252780848.
  25. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022b.
  26. Egorenderer: Rendering human avatars from egocentric camera images. In ICCV, 2021.
  27. Hvtr: Hybrid volumetric-textural rendering for human avatars. 3DV, 2022.
  28. Hvtr++: Image and pose driven human avatars using hybrid volumetric-textural rendering. IEEE Transactions on Visualization and Computer Graphics, pages 1–15, 2023. doi: 10.1109/TVCG.2023.3297721.
  29. Surmo: Surface-based 4d motion modeling for dynamic human rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  30. Image-to-image translation with conditional adversarial networks. CVPR, pages 5967–5976, 2017.
  31. Zero-shot text-guided object generation with dream fields. 2022.
  32. Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022. doi: 10.1145/3528223.3530104.
  33. Perceptual losses for real-time style transfer and super-resolution. ArXiv, abs/1603.08155, 2016. URL https://api.semanticscholar.org/CorpusID:980236.
  34. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  35. Ray tracing volume densities. Proceedings of the 11th annual conference on Computer graphics and interactive techniques, 1984.
  36. Dreampose: Fashion image-to-video synthesis via stable diffusion. 2023.
  37. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019a.
  38. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019b.
  39. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  40. Alias-free generative adversarial networks. In Proc. NeurIPS, 2021.
  41. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
  42. Adam: A method for stochastic optimization. In ICLR, 2015.
  43. Tryongan: Body-aware try-on via layered interpolation. ACM Transactions on Graphics (TOG), 40(4):1–10, 2021.
  44. Sphereface: Deep hypersphere embedding for face recognition. CVPR, pages 6738–6746, 2017.
  45. Hyperhuman: Hyper-realistic human generation with latent structural diffusion. arXiv preprint arXiv:2310.08579, 2023.
  46. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, pages 1096–1104, 2016.
  47. Mixture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (TOG), 40:1 – 13, 2021.
  48. Smpl: a skinned multi-person linear model. ACM Trans. Graph., 34:248:1–16, 2015.
  49. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
  50. Scale: Modeling clothed humans with a surface codec of articulated local elements. In CVPR, 2021a.
  51. The power of points for modeling humans in clothing. In ICCV, 2021b.
  52. Diffrf: Rendering-guided 3d radiance field diffusion. arXiv preprint arXiv:2212.01206, 2022.
  53. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  54. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  55. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
  56. Unsupervised learning of efficient geometry-aware neural articulated representations. In European Conference on Computer Vision, 2022a. URL https://api.semanticscholar.org/CorpusID:248239659.
  57. Unsupervised learning of efficient geometry-aware neural articulated representations. arXiv preprint arXiv:2204.08839, 2022b.
  58. Autodecoding latent 3d diffusion models, 2023.
  59. Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13503–13513, 2022.
  60. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
  61. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  62. Drivable volumetric avatars using texel-aligned features. ACM SIGGRAPH, 2022.
  63. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  64. Style and pose control for image synthesis of humans from a single monocular view. arXiv preprint arXiv:2102.11263, 2021a.
  65. Humangan: A generative model of humans images. arXiv preprint arXiv:2103.06902, 2021b.
  66. Fully convolutional networks for semantic segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2014. URL https://api.semanticscholar.org/CorpusID:1629541.
  67. 3d neural field generation using triplane diffusion. arXiv preprint arXiv:2211.16677, 2022.
  68. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
  69. Implicit neural representations with periodic activation functions. ArXiv, abs/2006.09661, 2020. URL https://api.semanticscholar.org/CorpusID:219720931.
  70. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  71. Next3d: Generative neural texture rasterization for 3d-aware head avatars. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20991–21002, 2022. URL https://api.semanticscholar.org/CorpusID:253735045.
  72. https://renderpeople.com/3d-people/. Renderpeople, 2018. URL https://renderpeople.com/3d-people/.
  73. Disco: Disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040, 2023.
  74. Rodin: A generative model for sculpting 3d digital avatars using diffusion. arXiv preprint arXiv:2212.06135, 2022.
  75. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
  76. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), June 2021.
  77. Difface: Blind face restoration with diffused error contraction. ArXiv, abs/2212.06512, 2022. URL https://api.semanticscholar.org/CorpusID:254591838.
  78. Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139, 2019.
  79. 3d human mesh regression with dense correspondence. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7052–7061, 2020. URL https://api.semanticscholar.org/CorpusID:219558352.
  80. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  81. Avatargen: a 3d generative model for animatable human avatars. arXiv preprint arXiv:2208.00561, 2022.
  82. Adding conditional control to text-to-image diffusion models, 2023.
  83. The unreasonable effectiveness of deep features as a perceptual metric. CVPR, pages 586–595, 2018.
  84. Structured local radiance fields for human avatar modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
  85. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.
  86. Ewa splatting. IEEE Trans. Vis. Comput. Graph., 8:223–238, 2002. URL https://api.semanticscholar.org/CorpusID:9389692.
Citations (4)

Summary

  • The paper introduces a semantic structured latent space that captures detailed human anatomy for enhanced 3D generation.
  • It employs a structured 3D-aware auto-decoder with part-specific Neural Radiance Fields for consistent, view-dependent rendering.
  • Empirical results demonstrate superior FID scores and practical applications in realistic pose, view, and clothing simulation.

StructLDM: Enhancing 3D Human Generation with Structured Latent Diffusion Models

Introduction

The evolution of 3D human generative models has ushered in an era where the boundary between digital and physical realities continues to blur. Despite significant advancements, traditional methods have primarily relied upon compact 1D latent spaces for modeling, which inherently neglect the complex, articulated structure and semantic richness of human anatomy. This paper introduces StructLDM, a novel approach leveraging structured latent diffusion models for generating 3D humans. StructLDM transcends traditional limitations by utilizing a high-dimensional, semantic structured latent space learned from 2D images, offering unprecedented control and fidelity in 3D human generation.

Challenges in Current 3D Human Modeling Techniques

Existing 3D human generative models, despite their progress, face significant hurdles:

  • They often oversimplify the human body's intricate structure, opting for a compact 1D latent space that limits control and expressiveness.
  • The generative quality, particularly for complex entities like humans, remains suboptimal when compared to simpler subjects such as faces or objects, indicating a need for more robust modeling methods.

StructLDM: Key Innovations and Design

StructLDM introduces three critical innovations to address these challenges:

  1. Semantic Structured Latent Space: By defining a latent space on the dense surface manifold of a statistical human body template, StructLDM captures the articulated nature of the human body, allowing for detailed appearance capture and editing.
  2. Structured 3D-Aware Auto-Decoder: This architecture factorizes the global latent space into body parts represented by conditional structured local Neural Radiance Fields (NeRFs). Such an arrangement enables the rendering of view-consistent humans under various poses and clothing styles.
  3. Structured Latent Diffusion Model: For generative human appearance sampling, StructLDM employs a novel diffusion process tailored with structure-specific normalization, facilitating control over 3D human generation and editing tasks.

Demonstrated Capabilities and Applications

StructLDM not only advances the state-of-the-art in 3D human generation but also unlocks new potentials for high-level manipulations such as:

  • Pose, view, and shape control for dynamic rendering of digital humans.
  • Compositional generations and part-aware editing without the need for explicit clothing masks.
  • Virtual try-on applications, allowing for realistic simulations of clothing on digital avatars.

Empirical Validation and Performance

Extensive experiments across multiple datasets (UBCFashion, RenderPeople, and THUman2.0) confirmed StructLDM's superior generative performance. Notably, it demonstrated remarkable proficiency in rendering high-quality, view-consistent digital humans with diverse appearances and in various poses. StructLDM significantly outperformed existing state-of-the-art models in terms of FID scores, highlighting its advanced generative capabilities and the effectiveness of structured latent diffusion in modeling complex 3D human figures.

Theoretical Implications and Future Directions

StructLDM's innovative use of structured latent spaces and diffusion models for 3D human generation represents a significant theoretical advancement. It highlights the importance of considering the intricate structure and semantics of human anatomy in generative modeling. Looking forward, the framework's ability to precisely control and manipulate digital human representations paves the way for future exploration in virtual and augmented reality applications, advanced animation techniques, and the development of more immersive digital experiences.

Conclusion

In summary, StructLDM sets a new benchmark for 3D human generation by effectively leveraging structured latent spaces and diffusion models. Its unique approach not only addresses the inherent limitations of previous methods but also opens up new avenues for exploration and application in the digital creation of human figures. As this field continues to evolve, StructLDM's contributions will undoubtedly serve as a foundational framework for future advancements in 3D human modeling.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com