Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning (2405.08054v1)
Abstract: As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.
- Learning representations and generative models for 3d point clouds. In International conference on machine learning. PMLR, 40–49.
- GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image. arXiv preprint arXiv:2404.02152 (2024).
- Sine: Semantic-driven image-based nerf editing with prior-guided editing field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20919–20929.
- Multidiffusion: Fusing diffusion paths for controlled image generation. (2023).
- LooseControl: Lifting ControlNet for Generalized Depth Conditioning. arXiv preprint arXiv:2312.03079 (2023).
- Ollin Boer Bohan. 2023. Tiny AutoEncoder for Stable Diffusion. https://github.com/madebyollin/taesd. Accessed: 2023-10-03.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16123–16133.
- pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5799–5809.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023).
- Control3d: Towards controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia. 1148–1156.
- Sem2nerf: Converting single-view semantic masks to neural radiance fields. In European Conference on Computer Vision. Springer, 730–748.
- Progressive3d: Progressively local editing for text-to-3d content creation with complex semantic prompts. arXiv preprint arXiv:2310.11784 (2023).
- Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4456–4465.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 628–644.
- Set-the-scene: Global-local training for generating controllable nerf scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2920–2929.
- Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13142–13153.
- 3d-aware conditional image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4434–4445.
- Composite shape modeling via latent space factorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8140–8149.
- Arpad E Elo. 1967. The proposed uscf rating system, its development, theory, and applications. Chess life 22, 8 (1967), 242–247.
- Blobgan: Spatially disentangled scene representations. In European Conference on Computer Vision. Springer, 616–635.
- A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition. 605–613.
- Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems 35 (2022), 31841–31854.
- A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 216–224.
- Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985 (2021).
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023).
- Gustavosta. 2023. MagicPrompt. https://huggingface.co/Gustavosta/MagicPrompt-Stable-Diffusion. Accessed: 2023-10-03.
- Gancraft: Unsupervised 3d neural rendering of minecraft worlds. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14072–14082.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789 (2023).
- T3Bench: Benchmarking Current Progress in Text-to-3D Generation. arXiv:2310.02977 [cs.CV]
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535 (2022).
- Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023).
- Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–11.
- Heewoo Jun and Alex Nichol. 2023. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023).
- Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion. arXiv preprint arXiv:2303.15780 (2023).
- Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV). 371–386.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4026.
- Cuboids revisited: Learning robust 3d shape fitting to single rgb images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13070–13079.
- Diffusion-sdf: Text-to-shape via voxelized diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12642–12651.
- Focaldreamer: Text-driven 3d editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608 (2023).
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
- One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885 (2023).
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928 (2023).
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9298–9309.
- SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. arXiv preprint arXiv:2309.03453 (2023).
- Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008 (2023).
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11461–11471.
- Realfusion: 360deg reconstruction of any object from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8446–8455.
- Face generation and editing with stylegan: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4460–4470.
- Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12663–12673.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023).
- Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning. PMLR, 7220–7229.
- Swiya Nath and Dénes Szücs. 2014. Construction play and cognitive skills associated with the development of mathematical abilities in 7-year-old children. Learning and instruction 32 (2014), 73–80.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022).
- Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D. arXiv preprint arXiv:2312.02190 (2023).
- Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165–174.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843 (2023).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Improving language understanding by generative pre-training. (2018).
- Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508 (2023).
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- TEXTure: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721 (2023).
- High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
- Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18603–18613.
- Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937 (2023).
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems 34 (2021), 6087–6101.
- Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023).
- MVDream: Multi-view Diffusion for 3D Generation. arXiv:2308.16512 (2023).
- Epigraf: Rethinking training of 3d gans. Advances in Neural Information Processing Systems 35 (2022), 24487–24501.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023).
- Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184 (2023).
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12619–12629.
- Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV). 52–67.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689 (2021).
- PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction. arXiv preprint arXiv:2311.12024 (2023).
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Marrnet: 3d shape reconstruction via 2.5 d sketches. Advances in neural information processing systems 30 (2017).
- GPT-4V (ision) is a Human-Aligned Evaluator for Text-to-3D Generation. arXiv preprint arXiv:2401.04092 (2024).
- Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision. 2690–2698.
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. arXiv:2304.05977 [cs.CV]
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20908–20918.
- Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217 (2023).
- Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In European Conference on Computer Vision. Springer, 597–614.
- Dreamspace: Dreaming your room space with text-driven panoramic texture propagation. In 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 650–660.
- Neural rendering in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured objects. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–10.
- Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 13779–13788.
- Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV). 767–783.
- Pushing the Limits of 3D Shape Generation at Scale. arXiv preprint arXiv:2306.11510 (2023).
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
- RePaint-NeRF: NeRF Editting via Semantic Masks and Diffusion Models. arXiv preprint arXiv:2306.05668 (2023).
- Wenqi Dong (7 papers)
- Bangbang Yang (18 papers)
- Lin Ma (206 papers)
- Xiao Liu (402 papers)
- Liyuan Cui (2 papers)
- Hujun Bao (134 papers)
- Yuewen Ma (10 papers)
- Zhaopeng Cui (64 papers)