Emergent Mind

SceneTeller: Language-to-3D Scene Generation

(2407.20727)
Published Jul 30, 2024 in cs.CV

Abstract

Designing high-quality indoor 3D scenes is important in many practical applications, such as room planning or game development. Conventionally, this has been a time-consuming process which requires both artistic skill and familiarity with professional software, making it hardly accessible for layman users. However, recent advances in generative AI have established solid foundation for democratizing 3D design. In this paper, we propose a pioneering approach for text-based 3D room design. Given a prompt in natural language describing the object placement in the room, our method produces a high-quality 3D scene corresponding to it. With an additional text prompt the users can change the appearance of the entire scene or of individual objects in it. Built using in-context learning, CAD model retrieval and 3D-Gaussian-Splatting-based stylization, our turnkey pipeline produces state-of-the-art 3D scenes, while being easy to use even for novices. Our project page is available at https://sceneteller.github.io/.

Comparison of SceneTeller and state-of-the-art methods in generating high-quality, geometrically accurate 3D scenes.

Overview

  • Öcal et al. introduce SceneTeller, a framework designed for converting natural language descriptions into realistic 3D scenes, leveraging generative AI advancements like LLMs and 3D Gaussian Splatting (3DGS).

  • The paper highlights a turnkey pipeline for 3D scene generation, an LLM-based module for precise 3D layout generation, and demonstrates SceneTeller's superior performance through extensive quantitative and qualitative evaluations.

  • SceneTeller simplifies 3D scene creation for both novices and experts, with implications across industries such as architecture, game development, and virtual reality, while also setting new precedents in multi-modal AI research.

SceneTeller: Language-to-3D Scene Generation

In this paper, Öcal et al. introduce SceneTeller, a comprehensive framework for text-driven 3D scene generation, which targets both novice and experienced users in fields such as architecture and game development. By leveraging advancements in generative AI, particularly LLMs and 3D Gaussian Splatting (3DGS), SceneTeller aims to automate and simplify the labor-intensive process of 3D scene creation.

Key Contributions

  1. Turnkey Pipeline for 3D Scene Generation: The paper presents a meticulously designed pipeline that transforms natural language descriptions into realistic 3D scenes. The pipeline is capable of both object- and scene-level appearance editing, making it versatile and user-friendly.
  2. LLM-based 3D Layout Generation: The authors propose a novel LLM-based module for generating 3D layouts from textual descriptions. The module uses in-context learning to ensure precise control over individual object placement, thereby enhancing the global consistency of the generated scenes.
  3. Quantitative & Qualitative Superiority: SceneTeller demonstrates superior performance compared to state-of-the-art methods, as evidenced by extensive evaluations that highlight its geometric fidelity, compositional plausibility, and overall user-friendliness.

Methodology

The methodology splits the task into three primary stages:

Language-driven 3D Layout Generation:

  • The approach begins with generating a 3D layout based on textual descriptions. Operating via in-context learning, the LLM is fed with task specifications and a set of exemplars, resulting in a set of 3D bounding boxes for object placement.

3D Scene Assembling:

3D Scene Stylization:

  • Once the 3D scene is assembled, it is represented using 3D Gaussian Splatting. This representation allows for rapid rendering and facilitates scene editing based on user-provided text prompts.

Evaluation and Results

Quantitative Evaluation:

User Study:

  • A user study involving 30 participants rates SceneTeller higher on realism, text alignment, geometric fidelity, and compositional plausibility compared to existing methods like GSGEN, LucidDreamer, and Set-the-scene.

Numerical Results:

  • SceneTeller demonstrates a superior CLIP Score, indicating high consistency between the generated scene styles and textual descriptions. It also achieves a lower Fréchet Inception Distance (FID) score, confirming the realism of the generated scenes when compared with real indoor datasets like ScanNet.

Implications and Future Directions

SceneTeller's implications are twofold: practical and theoretical. Practically, it simplifies the process of 3D scene creation, making it accessible to non-experts. This democratization could lead to broader adoption in various industries, such as virtual reality, urban planning, and interior design. Theoretically, the integration of LLMs and 3DGS sets a new precedent for multi-modal AI research, inspiring future work on improving text-to-3D generation's accuracy and scalability.

Future research directions might explore enhancing the training datasets for greater diversity, improving the speed and efficiency of the rendering process, and extending the framework to support more complex and dynamic scenes. Another avenue could be developing more sophisticated natural language processing techniques to better interpret nuanced user requirements.

Conclusion

The SceneTeller framework by Öcal et al. represents a significant advancement in the field of text-driven 3D scene generation. By integrating LLMs for layout reasoning and 3DGS for real-time rendering, it offers a robust, user-friendly approach to creating high-quality 3D scenes. The paper's comprehensive evaluations emphasize its practical applicability and superior performance relative to contemporary techniques, marking it a noteworthy contribution to both AI and digital design communities.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.