SceneTeller: Language-to-3D Scene Generation (2407.20727v1)

Published 30 Jul 2024 in cs.CV

Abstract: Designing high-quality indoor 3D scenes is important in many practical applications, such as room planning or game development. Conventionally, this has been a time-consuming process which requires both artistic skill and familiarity with professional software, making it hardly accessible for layman users. However, recent advances in generative AI have established solid foundation for democratizing 3D design. In this paper, we propose a pioneering approach for text-based 3D room design. Given a prompt in natural language describing the object placement in the room, our method produces a high-quality 3D scene corresponding to it. With an additional text prompt the users can change the appearance of the entire scene or of individual objects in it. Built using in-context learning, CAD model retrieval and 3D-Gaussian-Splatting-based stylization, our turnkey pipeline produces state-of-the-art 3D scenes, while being easy to use even for novices. Our project page is available at https://sceneteller.github.io/.

Summary

The paper introduces a three-stage framework that converts textual descriptions into structured 3D layouts, CAD assemblies, and realistic 3D scene stylizations.
It leverages a large language model with in-context learning to generate accurate spatial layouts and retrieves matching 3D CAD models for scene assembly.
Quantitative evaluations demonstrate enhanced realism, text alignment, and geometric fidelity, underpinning its potential in design, gaming, VR, and AR applications.

SceneTeller (2407.20727) is a novel framework designed to simplify the process of generating high-quality indoor 3D scenes directly from natural language descriptions. Traditionally, creating such scenes requires significant artistic skill and proficiency with complex 3D modeling software, making it inaccessible to most people. SceneTeller aims to democratize this process by providing a user-friendly, text-based interface for designing customized 3D rooms, including control over object placement and appearance.

The core of SceneTeller is a three-stage pipeline:

Language-driven 3D Layout Generation: This stage translates a natural language description of a scene into a structured 3D layout.
- Input: A textual prompt describing the desired positions and orientations of objects (e.g., "A sofa is placed in the center of the room facing the television on the far wall"), along with the scene type (e.g., bedroom, living room) and dimensions.
- Process: The system leverages a LLM, specifically GPT-4, through in-context learning. It constructs prompts containing task instructions, unit information, constraints (like avoiding overlapping objects or placing them out of bounds), and a selection of supporting examples ( $\mathcal{C}_{m}^{s}, \mathbf{b}_{m}^{s}$ ). These examples consist of text descriptions ( $\mathcal{C}_{m}^{s}$ ) and their corresponding ground-truth 3D bounding boxes ( $\mathbf{b}_{m}^{s}$ ). Retrieval-based selection is used to find examples similar to the query condition ( $\mathcal{C}_{q}$ ) based on scene dimensions. The textual descriptions for examples are generated using a rule-based approach and then paraphrased by an LLM to introduce linguistic variation. The LLM is tasked with generating a set of 3D bounding boxes for the query prompt.
- Output: A structured representation of the scene layout in CSS format, where each object is described by its category name and attributes like center coordinates $(\mathbf{t}_{i} = (x_{i}, y_{i}, z_{i}))$ , dimensions $(\mathbf{s}_{i} = (w_{i}, h_{i}, d_{i}))$ , and orientation $(o_{i})$ .
- Implementation Detail: The LLM is accessed via the ChatCompletions API.
3D Scene Assembling from Layouts: This stage populates the generated layout with actual 3D models.
- Input: The generated set of 3D bounding boxes from the previous stage.
- Process: For each predicted bounding box, a 3D CAD model of the corresponding category is retrieved from a dataset (specifically, the 3D-FUTURE dataset used in conjunction with the 3D-FRONT dataset for layouts). The retrieval is based on minimizing the Euclidean distance between the dimensions of the predicted bounding box $\mathbf{s}_{i}$ and the dimensions of candidate models $\mathbf{s}_{cand}$ from the dataset:
  
  $d(\mathbf{s}_{i}, \mathbf{s}_{cand}) = \sqrt{(w_{i} - w_{cand}))^{2} + (h_{i} - h_{cand}))^{2} + (d_{i} - d_{cand}))^{2}}$

* Output: An initial 3D scene composed of retrieved CAD models positioned and oriented according to the generated layout.

3D Scene Stylization: This stage allows modifying the appearance of the assembled scene.
- Input: The assembled 3D scene and an edit instruction in natural language (targeting the entire scene or specific objects).
- Process: The assembled scene is first converted into a 3D Gaussian Splatting (3DGS) representation. The centers of the initial 3D Gaussians are derived from the mesh vertices of the retrieved CAD models. Unlike standard 3DGS training which uses captured images, SceneTeller renders RGB images ( $I$ ) and corresponding 2D segmentation masks ( $\mathcal{M}$ ) of the assembled scene from multiple viewpoints using BlenderProc. These rendered images serve as the training data for optimizing the 3DGS representation. Scene editing is then performed by adapting the Instruct-GS2GS approach [igs2gs]. This involves iteratively editing the rendered training images using a 2D diffusion model, Instruct-Pix2Pix [instructpix2pix], based on the user-provided edit instruction.
- Object-level Editing: To enable editing only specific objects, binary masks $m_{\{o\}_{k=1}^{K}$ are created from the rendered segmentation masks $\mathcal{M}$ for the target object categories specified in the edit instruction. During the image editing step, the diffusion model's output for the current rendered image $I_{i}^{v}$ is masked to only retain edits corresponding to the target objects. The non-target pixels are replaced with the original unedited image $I_{0}^{v}$ :
  
  $I_{i+1}^{v} = m_{\{o\}_{k=1}^{K} \odot I_{i+1}^{v} + (1 - m_{\{o\}_{k=1}^{K}) \odot I_{0}^{v}$
  
  The 3DGS optimization uses the edited images, and the loss functions ( $\mathcal{L}_{1}$ and SSIM) are also masked to backpropagate gradients primarily to the Gaussians corresponding to the target objects.
- Output: A stylized 3DGS representation of the scene that reflects the edit instructions. This representation can then be rendered in real-time from novel viewpoints.
- Implementation Detail: Scene rendering for training data is done using BlenderProc (250 images at 512x512 resolution per scene). 3DGS training uses the Splatfacto model from NeRFstudio [nerfstudio] for up to 20k iterations. InstructPix2Pix is used for image editing during 3DGS training updates (every 2.5k iterations).

Practical Implementation Considerations:

Computational Resources: The pipeline involves several computationally intensive steps: querying a large LLM (like GPT-4), rendering multiple high-resolution images for 3DGS training, and training/editing a 3DGS model which leverages a diffusion model for image updates. The paper mentions 3DGS training with editing takes 15-20 minutes per scene on a single A6000 GPU, with diffusion model sampling being the bottleneck.
Data Dependencies: The method relies on a large database of 3D CAD models for scene assembly. The quality and diversity of generated scenes are dependent on the quality and variety of models available in the dataset.
LLM Performance: The quality of the initial layout generation heavily depends on the LLM's ability to interpret spatial relationships described in natural language and follow the in-context examples and constraints. The ablation paper highlights the importance of prompt design and exemplar selection.
3DGS Training Stability: Editing 3DGS scenes by iteratively modifying training images can sometimes introduce artifacts or inconsistencies, though the masked editing approach aims to mitigate this for object-specific edits.
Rendering Speed: Once the 3DGS scene is trained/stylized, rendering novel views is very fast (reported 101.21 fps), which is beneficial for interactive applications.

Real-World Applications:

SceneTeller's ability to generate and edit 3D scenes from simple text prompts makes it highly relevant for various applications:

Interior Design and Room Planning: Users can describe their desired room layout and style ("a minimalist living room with a blue sofa facing the window") and visualize it in 3D. They can then easily request style changes ("change the sofa to leather," "make the room look futuristic").
Game Development: Rapidly prototyping indoor environments based on textual descriptions. Artists can use text prompts for initial scene setup before adding detailed assets or manual refinements.
Virtual Reality (VR) and Augmented Reality (AR): Creating personalized 3D environments for VR/AR experiences based on user input.
E-commerce: Allowing customers to visualize furniture pieces within a customized 3D room layout described by them.

The evaluation results, particularly the user paper and quantitative metrics like CLIP Score and FID, indicate that SceneTeller outperforms existing methods in terms of realism, text alignment, geometric fidelity, and compositional plausibility, making it a practical and effective tool for text-to-3D scene generation. The LLM-based layout generation, supported by in-context learning, is a key factor in achieving better compositional consistency compared to prior text-to-room methods.

PDF Markdown

Related Papers

GitHub

SceneTeller: Language-to-3D Scene Generation

Tweets

https://twitter.com/basakmelisocal/status/1839659498698359037

https://twitter.com/_vztu/status/1819766401000956081