StructLDM: Structured Latent Diffusion for 3D Human Generation (2404.01241v3)
Abstract: Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can be decoded to render view-consistent humans under different poses and clothing styles. 3) A structured latent diffusion model for generative human appearance sampling. Extensive experiments validate StructLDM's state-of-the-art generation performance and illustrate the expressiveness of the structured latent space over the well-adopted 1D latent space. Notably, StructLDM enables different levels of controllable 3D human generation and editing, including pose/view/shape control, and high-level tasks including compositional generations, part-aware clothing editing, 3D virtual try-on, etc. Our project page is at: https://taohuumd.github.io/projects/StructLDM/.
- Gaussian shell maps for efficient 3d human generation, 2023.
- Generative neural articulated radiance fields. ArXiv, abs/2206.14314, 2022a. URL https://api.semanticscholar.org/CorpusID:250113850.
- Generative neural articulated radiance fields. arXiv preprint arXiv:2206.14314, 2022b.
- Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023a.
- Guide3d: Create 3d avatars from text and image guidance. arXiv preprint arXiv:2308.09705, 2023b.
- Efficient geometry-aware 3d generative adversarial networks. ArXiv, abs/2112.07945, 2021a.
- pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5799–5809, 2021b.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
- gdna: Towards generative detailed neural avatars. arXiv, 2022a.
- Uv volumes for real-time rendering of editable free-view human performance. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16621–16631, 2022b. URL https://api.semanticscholar.org/CorpusID:247762811.
- Primdiffusion: Volumetric primitives diffusion for 3d human generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Ag3d: Learning to generate 3d avatars from 2d image collections. ArXiv, abs/2305.02312, 2023. URL https://api.semanticscholar.org/CorpusID:258461509.
- Taming transformers for high-resolution image synthesis, 2020.
- Insetgan for full-body image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7723–7732, 2022.
- Stylegan-human: A data-centric odyssey of human generation. In European Conference on Computer Vision, 2022. URL https://api.semanticscholar.org/CorpusID:248377018.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Stylepeople: A generative model of fullbody human avatars. 2021 (CVPR), pages 5147–5156, 2021.
- Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099, 2020.
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning, pages 11808–11826. PMLR, 2023.
- 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. URL https://api.semanticscholar.org/CorpusID:206594692.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Eva3d: Compositional 3d human generation from 2d image collections. ArXiv, abs/2210.04888, 2022a. URL https://api.semanticscholar.org/CorpusID:252780848.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022b.
- Egorenderer: Rendering human avatars from egocentric camera images. In ICCV, 2021.
- Hvtr: Hybrid volumetric-textural rendering for human avatars. 3DV, 2022.
- Hvtr++: Image and pose driven human avatars using hybrid volumetric-textural rendering. IEEE Transactions on Visualization and Computer Graphics, pages 1–15, 2023. doi: 10.1109/TVCG.2023.3297721.
- Surmo: Surface-based 4d motion modeling for dynamic human rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Image-to-image translation with conditional adversarial networks. CVPR, pages 5967–5976, 2017.
- Zero-shot text-guided object generation with dream fields. 2022.
- Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022. doi: 10.1145/3528223.3530104.
- Perceptual losses for real-time style transfer and super-resolution. ArXiv, abs/1603.08155, 2016. URL https://api.semanticscholar.org/CorpusID:980236.
- Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
- Ray tracing volume densities. Proceedings of the 11th annual conference on Computer graphics and interactive techniques, 1984.
- Dreampose: Fashion image-to-video synthesis via stable diffusion. 2023.
- A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019a.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019b.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
- Alias-free generative adversarial networks. In Proc. NeurIPS, 2021.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Tryongan: Body-aware try-on via layered interpolation. ACM Transactions on Graphics (TOG), 40(4):1–10, 2021.
- Sphereface: Deep hypersphere embedding for face recognition. CVPR, pages 6738–6746, 2017.
- Hyperhuman: Hyper-realistic human generation with latent structural diffusion. arXiv preprint arXiv:2310.08579, 2023.
- Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, pages 1096–1104, 2016.
- Mixture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (TOG), 40:1 – 13, 2021.
- Smpl: a skinned multi-person linear model. ACM Trans. Graph., 34:248:1–16, 2015.
- Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
- Scale: Modeling clothed humans with a surface codec of articulated local elements. In CVPR, 2021a.
- The power of points for modeling humans in clothing. In ICCV, 2021b.
- Diffrf: Rendering-guided 3d radiance field diffusion. arXiv preprint arXiv:2212.01206, 2022.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
- Unsupervised learning of efficient geometry-aware neural articulated representations. In European Conference on Computer Vision, 2022a. URL https://api.semanticscholar.org/CorpusID:248239659.
- Unsupervised learning of efficient geometry-aware neural articulated representations. arXiv preprint arXiv:2204.08839, 2022b.
- Autodecoding latent 3d diffusion models, 2023.
- Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13503–13513, 2022.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Drivable volumetric avatars using texel-aligned features. ACM SIGGRAPH, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Style and pose control for image synthesis of humans from a single monocular view. arXiv preprint arXiv:2102.11263, 2021a.
- Humangan: A generative model of humans images. arXiv preprint arXiv:2103.06902, 2021b.
- Fully convolutional networks for semantic segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2014. URL https://api.semanticscholar.org/CorpusID:1629541.
- 3d neural field generation using triplane diffusion. arXiv preprint arXiv:2211.16677, 2022.
- Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
- Implicit neural representations with periodic activation functions. ArXiv, abs/2006.09661, 2020. URL https://api.semanticscholar.org/CorpusID:219720931.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Next3d: Generative neural texture rasterization for 3d-aware head avatars. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20991–21002, 2022. URL https://api.semanticscholar.org/CorpusID:253735045.
- https://renderpeople.com/3d-people/. Renderpeople, 2018. URL https://renderpeople.com/3d-people/.
- Disco: Disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040, 2023.
- Rodin: A generative model for sculpting 3d digital avatars using diffusion. arXiv preprint arXiv:2212.06135, 2022.
- High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
- Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), June 2021.
- Difface: Blind face restoration with diffused error contraction. ArXiv, abs/2212.06512, 2022. URL https://api.semanticscholar.org/CorpusID:254591838.
- Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139, 2019.
- 3d human mesh regression with dense correspondence. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7052–7061, 2020. URL https://api.semanticscholar.org/CorpusID:219558352.
- Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
- Avatargen: a 3d generative model for animatable human avatars. arXiv preprint arXiv:2208.00561, 2022.
- Adding conditional control to text-to-image diffusion models, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. CVPR, pages 586–595, 2018.
- Structured local radiance fields for human avatar modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
- 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.
- Ewa splatting. IEEE Trans. Vis. Comput. Graph., 8:223–238, 2002. URL https://api.semanticscholar.org/CorpusID:9389692.