AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation (2306.09864v1)
Abstract: We introduce AvatarBooth, a novel method for generating high-quality 3D avatars using text prompts or specific images. Unlike previous approaches that can only synthesize avatars based on simple text descriptions, our method enables the creation of personalized avatars from casually captured face or body images, while still supporting text-based model generation and editing. Our key contribution is the precise avatar generation control by using dual fine-tuned diffusion models separately for the human face and body. This enables us to capture intricate details of facial appearance, clothing, and accessories, resulting in highly realistic avatar generations. Furthermore, we introduce pose-consistent constraint to the optimization process to enhance the multi-view consistency of synthesized head images from the diffusion model and thus eliminate interference from uncontrolled human poses. In addition, we present a multi-resolution rendering strategy that facilitates coarse-to-fine supervision of 3D avatar generation, thereby enhancing the performance of the proposed system. The resulting avatar model can be further edited using additional text descriptions and driven by motion sequences. Experiments show that AvatarBooth outperforms previous text-to-3D methods in terms of rendering and geometric quality from either text prompts or specific images. Please check our project website at https://zeng-yifei.github.io/avatarbooth_page/.
- The digital emily project: Achieving a photorealistic digital actor. IEEE Computer Graphics and Applications 30, 4 (2010), 20–31.
- Tex2shape: Detailed full human body geometry from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2293–2303.
- Generative neural articulated radiance fields. Advances in Neural Information Processing Systems 35 (2022), 19900–19916.
- DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models. arXiv preprint arXiv:2304.00916 (2023).
- OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis & Machine Intelligence 43, 01 (2021), 172–186.
- Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291–7299.
- Efficient geometry-aware 3D generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16123–16133.
- Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. arXiv preprint arXiv:2303.13873 (2023).
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
- Real-time geometry, albedo, and motion reconstruction using a single rgb-d camera. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1.
- High-fidelity 3D Human Digitization from Single 2K Resolution Images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
- Arch++: Animation-ready clothed human reconstruction revisited. In Proceedings of the IEEE/CVF international conference on computer vision. 11046–11056.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
- Eva3d: Compositional 3d human generation from 2d image collections. In International Conference on Learning Representations.
- AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–19.
- Debiasing Scores and Prompts of 2D Diffusion for Robust Text-to-3D Generation. arXiv preprint arXiv:2303.15413 (2023).
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
- Arch: Animatable reconstruction of clothed humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3093–3102.
- AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control. arXiv preprint arXiv:2303.17606 (2023).
- HumanGen: Generating Human Radiance Fields with Explicit Priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations (ICLR).
- Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. arXiv preprint arXiv:2305.01569 (2023).
- Magic3D: High-Resolution Text-to-3D Content Creation. arXiv preprint arXiv:2211.10440 (2022).
- SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34, 6 (Oct. 2015), 248:1–248:16.
- Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. arXiv preprint arXiv:2211.07600 (2022).
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision.
- Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165–174.
- Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 10975–10985.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9054–9063.
- Dreamfusion: Text-to-3d using 2d diffusion. In Proceedings of the International Conference on Learning Representations (ICLR).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML). 8748–8763.
- DreamBooth3D: Subject-Driven Text-to-3D Generation. arXiv preprint arXiv:2303.13508 (2023).
- TEXTure: Text-Guided Texturing of 3D Shapes. ACM Trans. Graphics (Proc. SIGGRAPH) (2023).
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Advances in Neural Information Processing Systems, Vol. 35. 36479–36494.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF international conference on computer vision. 2304–2314.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 84–93.
- DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model. arXiv preprint arXiv:2304.02827 (2023).
- Denoising Diffusion Implicit Models. In International Conference on Learning Representations.
- Self-supervised human depth estimation from monocular videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 650–659.
- Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European conference on computer vision (ECCV). 20–36.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation. arXiv preprint arXiv:2212.00774 (2022).
- NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. Advances in Neural Information Processing Systems 34 (2021), 27171–27183.
- Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4563–4573.
- Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international conference on computer vision. 3681–3691.
- High-Fidelity 3D Face Generation From Natural Language Descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4521–4530.
- Detailed facial geometry recovery from multi-view images by learning an implicit function. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2839–2847.
- ECON: Explicit Clothed humans Optimized via Normal integration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
- ICON: implicit clothed humans obtained from normals. In Proceedings of the IEEE/CVF international conference on computer vision. 13286–13296.
- Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 601–610.
- CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes. In Proceedings of the European conference on computer vision (ECCV). 173–191.
- Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023).
- DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance. arXiv preprint arXiv:2304.03117 (2023).
- Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE transactions on pattern analysis and machine intelligence 44, 6 (2021), 3170–3184.
- Deephuman: 3d human reconstruction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7739–7749.
- Video-based outdoor human reconstruction. IEEE Transactions on Circuits and Systems for Video Technology 27, 4 (2016), 760–770.
- View extrapolation of human body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4450–4459.
- FacesCape: 3D facial dataset and benchmark for single-view 3D face reconstruction. arXiv preprint arXiv:2111.01082 (2021).
- Detailed human shape estimation from a single image by hierarchical mesh deformation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4491–4500.
- Detailed avatar recovery from single image. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 11 (2021), 7363–7379.
- Yifei Zeng (4 papers)
- Yuanxun Lu (9 papers)
- Xinya Ji (6 papers)
- Yao Yao (235 papers)
- Hao Zhu (212 papers)
- Xun Cao (77 papers)