Sketch3D: Style-Consistent Guidance for Sketch-to-3D Generation (2404.01843v2)
Abstract: Recently, image-to-3D approaches have achieved significant results with a natural image as input. However, it is not always possible to access these enriched color input samples in practical applications, where only sketches are available. Existing sketch-to-3D researches suffer from limitations in broad applications due to the challenges of lacking color information and multi-view content. To overcome them, this paper proposes a novel generation paradigm Sketch3D to generate realistic 3D assets with shape aligned with the input sketch and color matching the textual description. Concretely, Sketch3D first instantiates the given sketch in the reference image through the shape-preserving generation process. Second, the reference image is leveraged to deduce a coarse 3D Gaussian prior, and multi-view style-consistent guidance images are generated based on the renderings of the 3D Gaussians. Finally, three strategies are designed to optimize 3D Gaussians, i.e., structural optimization via a distribution transfer mechanism, color optimization with a straightforward MSE loss and sketch similarity optimization with a CLIP-based geometric similarity loss. Extensive visual comparisons and quantitative analysis illustrate the advantage of our Sketch3D in generating realistic 3D assets while preserving consistency with the input.
- Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond. arXiv preprint arXiv:2304.04968 (2023).
- MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. arXiv preprint arXiv:2304.08465 (2023).
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22246–22256.
- Control3d: Towards controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia. 1148–1156.
- SketchSampler: Sketch-Based 3D Reconstruction via View-Dependent Depth Sampling. In European Conference on Computer Vision. Springer, 464–479.
- Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems 35 (2022), 31841–31854.
- Sketch2mesh: Reconstructing and editing 3d shapes from sketches. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13023–13032.
- 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371 (2023).
- Customize-It-3D: High-Quality 3D Creation from A Single Image Using Subject-Specific Knowledge Prior. arXiv preprint arXiv:2312.11535 (2023).
- DreamTime: An Improved Optimization Strategy for Text-to-3D Content Creation. arXiv preprint arXiv:2306.12422 (2023).
- Heewoo Jun and Alex Nichol. 2023. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023).
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (2023).
- A Diffusion-ReFinement Model for Sketch-to-Point Modeling. In Proceedings of the Asian Conference on Computer Vision. 1522–1538.
- Controllable text-to-image generation. Advances in Neural Information Processing Systems 32 (2019).
- Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era. arXiv preprint arXiv:2305.06131 (2023).
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22511–22521.
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
- SketchFaceNeRF: Sketch-based facial generation and editing in neural radiance fields. ACM Transactions on Graphics (2023).
- Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior. arXiv preprint arXiv:2312.06655 (2023).
- One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885 (2023).
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928 (2023).
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9298–9309.
- ATT3D: Amortized Text-to-3D Object Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 17946–17956.
- 3d shape reconstruction from sketches via multi-view convolutional networks. In 2017 International Conference on 3D Vision (3DV). IEEE, 67–77.
- Sked: Sketch-guided text-based 3d editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14607–14619.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023).
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022).
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning. PMLR, 16784–16804.
- Autodecoding latent 3d diffusion models. arXiv preprint arXiv:2307.05445 (2023).
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
- PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds. arXiv preprint arXiv:2303.09695 (2023).
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843 (2023).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Dreambooth3d: Subject-driven text-to-3d generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2349–2359.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
- Sketch-a-shape: Zero-shot sketch-to-3d shape generation. arXiv preprint arXiv:2307.03869 (2023).
- Zero-shot multi-modal artist-controlled retrieval and exploration of 3d object sets. In SIGGRAPH Asia 2022 Technical Communications. Association for Computing Machinery, 1–4.
- Text-to-4d dynamic scene generation. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 31915–31929.
- Stability AI. 2023. Stable Zero123: Quality 3D Object Generation from Single Images. https://stability.ai/news/stable-zero123-3d-generation Online; accessed 13 December 2023.
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023).
- Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 22762–22772.
- TextMesh: Generation of Realistic 3D Meshes From Text Prompts. arXiv preprint arXiv:2304.12439 (2023).
- Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–11.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12619–12629.
- ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. arXiv preprint arXiv:2305.16213 (2023).
- Sketch and text guided diffusion model for colored point cloud generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8929–8939.
- Haifeng Xia and Zhengming Ding. 2020. Structure preserving generative cross-domain learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4364–4373.
- Maximum structural generation discrepancy for unsupervised domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3434–3445.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023).
- Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023).
- pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4578–4587.
- Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia. 6841–6850.
- 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Trans. Graph. (2023).
- Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting. arXiv preprint arXiv:2312.13271 (2023).
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
- Sketch2model: View-aware 3d modeling from single free-hand sketches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6012–6021.
- Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2305.16322 (2023).
- Locally attentional sdf diffusion for controllable 3d shape generation. arXiv preprint arXiv:2305.04461 (2023).
- Locally attentional sdf diffusion for controllable 3d shape generation. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–13.
- Joseph Zhu and Peiye Zhuang. 2023. HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance. arXiv preprint arXiv:2305.18766 (2023).
- Dreameditor: Text-driven 3d scene editing with neural fields. In SIGGRAPH Asia 2023 Conference Papers. 1–10.