Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases (2403.09675v1)
Abstract: We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object categories -- we call this setting indoor scene generation. Unlike most prior work on indoor scene generation, our system does not require a large training dataset of existing 3D scenes. Instead, it leverages the world knowledge encoded in pre-trained LLMs to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. Executing such a program produces a specification of a constraint satisfaction problem, which the system solves using a gradient-based optimization scheme to produce object positions and orientations. To produce object geometry, the system retrieves 3D meshes from a database. Unlike prior work which uses databases of category-annotated, mutually-aligned meshes, we develop a pipeline using vision-LLMs (VLMs) to retrieve meshes from massive databases of un-annotated, inconsistently-aligned meshes. Experimental evaluations show that our system outperforms generative models trained on 3D data for traditional, closed-universe scene generation tasks; it also outperforms a recent LLM-based layout generation method on open-universe scene generation.
- Zero-Shot 3D Shape Correspondence. In SIGGRAPH Asia.
- SATR: Zero-Shot Semantic Segmentation of 3D Shapes. In Proceedings of the International Conference on Computer Vision (ICCV).
- Google DeepMind AlphaCode Team. 2023. AlphaCode 2 Technical Report. (2023).
- CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout. arXiv:2303.13843 [cs.CV]
- Graph Drawing: Algorithms for the Visualization of Graphs (1st ed.). Prentice Hall PTR, USA.
- Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- Visual Programming for Text-to-Image Generation and Evaluation. In NeurIPS.
- Bob Coyne and Richard Sproat. 2001. WordsEye: an automatic text-to-scene conversion system. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’01). Association for Computing Machinery, New York, NY, USA, 487–496. https://doi.org/10.1145/383259.383316
- 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions. CVPR.
- Objaverse-XL: A Universe of 10M+ 3D Objects. arXiv preprint arXiv:2307.05663 (2023).
- Objaverse: A Universe of Annotated 3D Objects. arXiv preprint arXiv:2212.08051 (2022).
- ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 5982–5994. https://proceedings.neurips.cc/paper_files/paper/2022/file/27c546ab1e4f1d7d638e6a8dfbad9a07-Paper-Conference.pdf
- The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
- Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints. arXiv preprint arXiv:2310.03602 (2023).
- LayoutGPT: Compositional Visual Planning and Generation with Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=Xu8aG5Q8M3
- Example-based synthesis of 3D object arrangements. ACM Transactions on Graphics (TOG) 31, 6 (2012), 135:1–11.
- 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10933–10942.
- Upright orientation of man-made objects. In ACM SIGGRAPH 2008 Papers (Los Angeles, California) (SIGGRAPH ’08). Association for Computing Machinery, New York, NY, USA, Article 42, 7 pages. https://doi.org/10.1145/1399504.1360641
- GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs. arXiv 2312.00093 (2023).
- SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation with Fine-Grained Geometry. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023), 1–18. https://doi.org/10.1109/TPAMI.2023.3237577
- OpenAI GPT-4. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Learning Interpretable Libraries by Compressing and Documenting Code. In Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023. https://openreview.net/forum?id=4gYLottfsf
- Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual Programming: Compositional Visual Reasoning Without Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14953–14962.
- CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs. arXiv preprint arXiv:2311.16703 (2023).
- Jordan Hobbs. 2024. Why IKEA Uses 3D Renders vs. Photography for Their Furniture Catalog. https://www.cadcrowd.com/blog/why-ikea-uses-3d-renders-vs-photography-for-their-furniture-catalog/. Accessed: 2024-01-19.
- Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 7909–7920.
- Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions. arXiv preprint arXiv:2306.06212 (2023).
- Large Language Models Cannot Self-Correct Reasoning Yet. arXiv:2310.01798 [cs.CL]
- Zero-Shot Text-Guided Object Generation with Dream Fields. (2022).
- Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit Functions. arXiv:2305.02463 [cs.CV]
- Learning 3D Scene Synthesis from Annotated RGB-D Images. In Computer Graphics Forum, Vol. 35. 197–206.
- GRAINS: Generative Recursive Autoencoders for INdoor Scenes. CoRR arXiv:1807.09193 (2018).
- Competition-level code generation with AlphaCode. Science 378, 6624 (Dec. 2022), 1092–1097. https://doi.org/10.1126/science.abq1158
- Automatic Data-Driven Room Design Generation. In Next Generation Computer Animation Techniques, Jian Chang, Jian Jun Zhang, Nadia Magnenat Thalmann, Shi-Min Hu, Ruofeng Tong, and Wencheng Wang (Eds.). Springer International Publishing, Cham, 133–148.
- Magic3D: High-Resolution Text-to-3D Content Creation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21736–21746.
- ATT3D: Amortized Text-To-3D Object Synthesis. arXiv (2023).
- Scalable 3D Captioning with Pretrained Models. arXiv preprint arXiv:2306.07279 (2023).
- How Can Large Language Models Help Humans in Design and Manufacturing? arXiv:2307.14377 [cs.CL]
- Interactive furniture layout using interior design guidelines. In ACM SIGGRAPH 2011 Papers (Vancouver, British Columbia, Canada) (SIGGRAPH ’11). Association for Computing Machinery, New York, NY, USA, Article 87, 10 pages. https://doi.org/10.1145/1964921.1964982
- 4M: Massively Multimodal Masked Modeling. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=TegmlsD8oQ
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts. arXiv:2212.08751 [cs.CV]
- ATISS: Autoregressive Transformers for Indoor Scene Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
- Planner5d. 2024. Planner5d: House Design Software. https://planner5d.com. Accessed: 2024-01-19.
- DreamFusion: Text-to-3D using 2D Diffusion. arXiv (2022).
- Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots.
- Human-centric Indoor Scene Synthesis Using Stochastic Grammar. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763.
- Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models. In CVPR 2019.
- High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
- Mathematical discoveries from program search with large language models. Nature (2023). https://doi.org/10.1038/s41586-023-06924-6
- RoomSketcher. 2024. Create Floor Plans and Home Designs Online. http://www.roomsketcher.com. Accessed: 2024-01-19.
- ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- CLIP-Forge: Towards Zero-Shot Text-To-Shape Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18603–18613.
- CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- ControlRoom3D: Room Generation using Semantic Proxy Rooms. arXiv:2312.05208 (2023).
- 3D-GPT: Procedural 3D Modeling with Large Language Models. arXiv:2310.12945 [cs.CV]
- ViperGPT: Visual Inference via Python Execution for Reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV) (2023).
- DiffuScene: Scene Graph Denoising Diffusion Probabilistic Model for Generative Indoor Scene Synthesis. In arxiv.
- Target. 2024. Home Planner. https://www.target.com/room-planner/home. Accessed: 2024-01-19.
- Solving Olympiad Geometry without Human Demonstrations. Nature (2024). https://doi.org/10.1038/s41586-023-06747-5
- Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics (TOG) 38, 4 (2019), 132.
- Deep Convolutional Priors for Indoor Scene Synthesis. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
- SceneFormer: Indoor Scene Generation with Transformers. arXiv preprint arXiv:2012.09793 (2020).
- ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. In Advances in Neural Information Processing Systems (NeurIPS).
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS.
- ULIP: Learning Unified Representation of Language, Image and Point Cloud for 3D Understanding. In CVPR 2023.
- Habitat-Matterport 3D Semantics Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4927–4936.
- Holodeck: Language Guided Generation of 3D Embodied AI Environments. arXiv preprint arXiv:2312.09067 (2023).
- Synthesizing open worlds with constraints using locally annealed reversible jump MCMC. 31, 4, Article 56 (jul 2012), 11 pages. https://doi.org/10.1145/2185520.2185552
- GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models. arXiv preprint arXiv:2310.08529 (2023).
- Make it home: automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG) 30, 4 (2011), 86:1–12.
- Sigmoid Loss for Language Image Pre-Training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 11975–11986.
- Sigmoid Loss for Language Image Pre-Training. In ICLR 2023.
- Deep Generative Modeling for Scene Synthesis via Hybrid Representations. CoRR abs/1808.02084 (2018). arXiv:1808.02084 http://arxiv.org/abs/1808.02084
- PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation. arXiv:2312.03015 [cs.CV]
- SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation. In IEEE Conference on Computer Vision (ICCV).