Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints (2310.03602v4)
Abstract: Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.
- Gaudi: A neural architect for immersive 3d scene generation. Proc. NeurIPS, 35:25102–25116, 2022.
- Text2shape: Generating shapes from natural language by learning joint embeddings. In Proc. ACCV, pp. 100–116. Springer, 2019.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023.
- Text2light: Zero-shot text-driven hdr panorama generation. ACM Trans. Graphics, 41(6):1–16, 2022.
- github. Controlnetgithubmodel. https://github.com/lllyasviel/ControlNet-v1-1-nightly#controlnet-11-segmentation, 2023.
- Layouttransformer: Layout generation and completion with self-attention. In Proc. ICCV, pp. 1004–1014, 2021.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Proc. NeurIPS, 30, 2017.
- Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989, 2023.
- Layoutvae: Stochastic scene layout generation from a label set. In Proc. ICCV, pp. 9895–9904, 2019.
- Imagic: Text-based real image editing with diffusion models. In Proc. CVPR, pp. 6007–6017, 2023.
- Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, volume 7, pp. 0, 2006.
- Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767, 2019.
- Magic3d: High-resolution text-to-3d content creation. In Proc. CVPR, pp. 300–309, 2023.
- Coco-gan: Generation by parts via conditional coordinating. In Proc. ICCV, pp. 4512–4521, 2019.
- Infinitygan: Towards infinite-pixel image synthesis. arXiv preprint arXiv:2104.03963, 2021.
- Guided image synthesis via initial image editing in diffusion model. arXiv preprint arXiv:2305.03382, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- Atiss: Autoregressive transformers for indoor scene synthesis. Proc. NeurIPS, 34:12013–12026, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Learning transferable visual models from natural language supervision. In Proc. ICML, pp. 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proc. CVPR, pp. 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241. Springer, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Proc. NeurIPS, 35:36479–36494, 2022.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Proc. NeurIPS, 34:6087–6101, 2021.
- Panoformer: Panorama transformer for indoor 360 depth estimation. In Proc. ECCV, pp. 195–211. Springer, 2022.
- Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
- Conditional 360-degree image synthesis for immersive indoor scene decoration. arXiv preprint arXiv:2307.09621, 2023.
- Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation. In Proc. CVPR, pp. 1047–1056, 2019.
- Diffuscene: Scene graph denoising diffusion probabilistic model for generative indoor scene synthesis. arXiv preprint arXiv:2303.14207, 2023a.
- Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881, 2023b.
- Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097, 2023c.
- Let there be color! large-scale texturing of 3d reconstructions. In Proc. ECCV, pp. 836–850. Springer, 2014.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proc. CVPR, pp. 12619–12629, 2023a.
- Sceneformer: Indoor scene generation with transformers. In 2021 International Conference on 3D Vision (3DV), pp. 106–115. IEEE, 2021.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
- Ipo-ldm: Depth-aided 360-degree indoor rgb panorama outpainting via latent diffusion model. arXiv preprint arXiv:2307.03177, 2023.
- Text2nerf: Text-driven 3d scene generation with neural radiance fields. arXiv preprint arXiv:2305.11588, 2023.
- Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
- Structured3d: A large photo-realistic dataset for structured 3d modeling. In Proc. ECCV, pp. 519–535. Springer, 2020.