Emergent Mind

Abstract

We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: https://neu-vi.github.io/houseCrafter/

HouseCrafter converts floorplans to 3D scenes, generating and fusing multi-view RGB-D images into detailed meshes.

Overview

  • HouseCrafter proposes a method to generate 3D indoor scenes from 2D floorplans using advanced 2D diffusion models, which leverage pre-trained models to synthesize and fuse RGB and depth images from multiple viewpoints.

  • Key contributions include an autoregressive generation of multi-view RGB-D images, the introduction of a layout-attention mechanism for incorporating floorplan guidance, and the use of depth information for enhanced view synthesis.

  • The methodology utilizes a novel view synthesis model and depth-enhanced camera positional encoding, demonstrating superior results on quantitative metrics and user preferences over baseline methods, and outlining practical and theoretical implications for various industries.

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

The paper "HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models" presents a method for generating large-scale 3D indoor scenes from 2D floorplans using advanced 2D diffusion models. The system leverages pre-trained 2D models, originally trained on vast amounts of 2D image data, to synthesize RGB and depth images at multiple viewpoints. These generated images are subsequently fused to reconstruct a consistent and detailed 3D scene. HouseCrafter robustly handles the complexities of house-scale environments, providing promising results in generating highly detailed, faithful, and coherent 3D representations guided by given floorplans.

Key Contributions

  1. Autoregressive Generation of RGB-D Images: HouseCrafter adapts a 2D diffusion model to autoregressively generate multi-view RGB-D images. This generation is done in a batch-wise manner using previously generated images as conditions, ensuring inter-view consistency. The method uses a novel-view synthesis pipeline allowing efficient and semantically consistent generation.

  2. Integration of 2D Floorplan Guidance: The model introduces a layout-attention mechanism to incorporate floorplan information at different scales into the diffusion process, improving the global consistency of the generated large-scale scenes. The injection of geometric and semantic details from the floorplan ensures adherence to the specified configuration.

  3. Depth-Enhanced View Synthesis: HouseCrafter includes depth information in both input and output stages, decoupling geometry and appearance. This enhancement facilitates a more accurate 3D scene reconstruction, addressing the limitations of prior methods that suffer from scale ambiguity and depth inconsistencies.

Methodology

Novel View RGB-D Image Generation

The core of HouseCrafter is its novel view synthesis model which extends a pre-trained UNet from the StableDiffusion v1.5 to handle RGB-D data. The model processes multiple views simultaneously, ensuring cross-view consistency. The integration of the floorplan happens at several layers of the UNet as a layout-attention mechanism, which allows the input latent features to be modulated by the encoded layout information independently for each ray going through the image.

Depth-Enhanced Camera Positional Encoding (DeCaPE)

To leverage depth information from reference views, the model employs DeCaPE, an augmented positional encoding that incorporates 3D positions of reference image features. This encoding improves the cross-attention mechanism between target and reference features, enhancing the geometric consistency across views.

Results

The method has been evaluated on the 3D-Front dataset, showcasing its capability to generate high-quality 3D scenes from floorplans. Quantitative metrics for image quality (FID, IS) and depth (AbsRel, $\delta_i$) demonstrate the superior performance of HouseCrafter over baseline methods like CC3D and Text2Room. The ablation studies underline the importance of depth conditioning and floorplan guidance, showing significant improvements in consistency and visual fidelity when these components are included.

User Study and Layout Compliance

An extensive user study further corroborates the quantitative results, indicating a strong preference for HouseCrafter's outputs in terms of visual appeal and alignment with given floorplans. Additionally, the use of ODIN for layout compliance metrics confirms that HouseCrafter's generated scenes better adhere to the input floorplan configuration, with mAP scores significantly higher than those of the baselines.

Implications and Future Directions

The research presented in this paper holds substantial practical and theoretical implications. On a practical level, it offers a scalable and efficient tool for generating detailed 3D indoor scenes, which can significantly reduce manual effort in industries like architecture, interior design, and real estate visualization. Theoretically, this work demonstrates the potential of combining 2D generative models with floorplan guidance to overcome the challenges associated with scarce 3D data.

Future research could explore:

  • Enhanced 3D Reconstruction Techniques: Developing reconstruction methods that can model view-dependent colors to improve the realism of the textured meshes.
  • Optimized Pose Sampling: Designing more efficient pose sampling strategies that balance between consistency and computational efficiency.
  • Instance-aware Generation: Integrating instance-level information to further improve fidelity to the input floorplans.

Overall, HouseCrafter is a notable advancement towards automated, scalable, and high-fidelity 3D scene generation from 2D layouts, pushing the boundaries of current techniques and opening new avenues for practical applications and research enhancements.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.