Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text (2406.17601v1)

Published 25 Jun 2024 in cs.CV

Abstract: Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined cameras. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories. To achieve this, (1) we first utilize a Trajectory Diffusion Transformer, acting as the Cinematographer, to model the distribution of camera trajectories based on textual descriptions. (2) Next, a Gaussian-driven Multi-view Latent Diffusion Model serves as the Decorator, modeling the image sequence distribution given the camera trajectories and texts. This model, fine-tuned from a 2D diffusion model, directly generates pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising. (3) Lastly, the 3D Gaussians are refined by a novel SDS++ loss as the Detailer, which incorporates the prior of the 2D diffusion model. Extensive experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a three-component framework—comprising a Trajectory Diffusion Transformer, a Gaussian-driven latent model, and an SDS++ refinement loss—to generate coherent 3D scenes from text.
The paper leverages real-world multi-view datasets to overcome limitations of synthetic models, achieving enhanced scene realism and consistent camera trajectories.
The paper validates its approach with qualitative improvements and quantitative gains in metrics such as BRISQUE, NIQE, and CLIP-Score over existing baseline methods.

Director3D (2406.17601) is a framework for generating realistic, open-world 3D scenes and corresponding camera trajectories from text descriptions. Addressing the limitations of previous text-to-3D methods that often rely on synthetic datasets or struggle with real-world complexity, Director3D leverages real-world multi-view data to produce higher-fidelity results. The paper identifies key challenges in using real-world data: complex and scene-specific camera trajectories, unbounded scenes, and the limited diversity/quantity of available real-world captures compared to synthetic datasets. Director3D tackles these issues with a three-component pipeline: a Cinematographer, a Decorator, and a Detailer.

The Cinematographer component is implemented as a Trajectory Diffusion Transformer (Traj-DiT). Its practical role is to generate a plausible, dense camera trajectory conditioned on the input text. Instead of using predefined camera paths, Traj-DiT learns to model the distribution of complex trajectories observed in real-world datasets like MVImgNet and DL3DV-10K. Camera parameters (rotation, translation, focal length, principle points) are treated as temporal tokens. An adaptation of the Diffusion Transformer (DiT) architecture, incorporating learnable temporal embeddings and cross-attention to text embeddings, is trained to denoise noisy camera trajectories. The training minimizes an $x_0$ -prediction diffusion objective on camera parameters. For implementation, the paper uses the Adam optimizer with a learning rate of $1e^{-4}$ and trains for 50K iterations, which takes about 2 days on a single NVIDIA Tesla A100 GPU. The model uses 8 DiT blocks with a hidden size of 512. During inference, DDIM is used with 100 steps for faster generation from the 1000 steps used in training. Trajectories are normalized to ensure consistency across scenes before being processed by the model.

The Decorator component, a Gaussian-driven Multi-view Latent Diffusion Model (GM-LDM), takes the generated camera trajectory and text as input to produce an initial 3D scene represented by pixel-aligned 3D Gaussians. This component is fine-tuned from a 2D Latent Diffusion Model (like Stable Diffusion v2.1) to leverage its strong image generation priors. To manage computational load, it operates on a sparse-view subset (N=8) sampled from the dense trajectory (M=29). The model performs denoising on multi-view latents obtained by encoding sparse-view images (real or rendered). Denoising involves both 2D-based steps (using modified cross-view self-attention in the U-Net) and rendering-based steps. In rendering-based denoising, denoised latents and additional features are passed through a Gaussians decoder ( $\mathcal{D}_\mathcal{G}$ ), which is initialized from the original Stable Diffusion decoder, to output pixel-aligned 3D Gaussian parameters (depth, rotation, scale, opacity, color). These 3D Gaussians are then rendered, and the rendered images' latents are used in the denoising process. This rendering loop directly enforces 3D consistency during diffusion. The model is trained with a combined loss: $\mathcal{L} = \mathcal{L}_{\text{2d}} + \mathcal{L}_{\text{3d}}$ , where $\mathcal{L}_{\text{2d}}$ is a standard multi-view latent diffusion objective and $\mathcal{L}_{\text{3d}}$ is a reconstruction loss comparing rendered images of the predicted Gaussians (using cameras from the dense trajectory) to ground truth images. A key practical consideration is addressing the limited real-world multi-view data; the GM-LDM is collaboratively trained with both multi-view datasets (MVImgNet, DL3DV-10K) and a large 2D dataset (LAION) to improve generalization. This significantly increases training data diversity. Training takes about a week on 16 NVIDIA Tesla A100 GPUs for 150K iterations, using the Adam optimizer with a learning rate of $5e^{-5}$ . Image resolution is 256x256, corresponding to 32x32 latents. Classifier-free guidance scales vary between 2D-based (7.5) and rendering-based (1) denoising steps during inference to balance generalization and 3D consistency, with rendering-based steps used for 1/10 of the total denoising steps.

The Detailer component refines the initial 3D Gaussians using a novel SDS++ loss. The purpose of this refinement is to improve visual details, which can be lacking in the initial Gaussians due to the GM-LDM's training data limitations. The SDS++ loss is designed to effectively leverage the prior of a pretrained 2D diffusion model (like Stable Diffusion v2.1). It optimizes the 3D Gaussian parameters by backpropagating gradients from images rendered at randomly interpolated camera poses along the dense trajectory. The loss formulation incorporates three key principles for effective score distillation: using an appropriate target distribution, adaptively estimating the current distribution of the rendered scene, and combining both latent-space and image-space objectives. The core of the SDS++ loss is defined as:

$\mathcal{L}_{\text{SDS++} = \mathbb{E}_{t,c,\epsilon} \left[w(t) \frac{\sqrt{\bar{\alpha}_t}{\sqrt{1 - \bar{\alpha}_t} \left(\lambda_z \| z-\hat{z}\|^2_2+\lambda_x\|x-\hat{x}\|^2_2\right)\right]$

where $z$ and $x$ are the latent and image of the rendered scene, $\hat{z}$ and $\hat{x}$ are the predicted latent and image from the 2D diffusion model, $t$ is the timestep, $c$ is the camera, $\epsilon$ is sampled noise, $w(t)$ is a weighting function, and $\lambda_z, \lambda_x$ are weights for latent and image terms. The predicted noise $\hat{\epsilon}$ used to derive $\hat{z}$ incorporates a target prediction ( $\hat{\epsilon}_{\text{trg}}$ ) based on the text condition and an adaptive source prediction ( $\hat{\epsilon}_{\text{src}}$ ) estimated by the diffusion model with a learnable text embedding $\hat{y}$ , allowing the optimization to push away from the current state. Refinement runs for 1000 iterations, which takes additional time after the initial GM-LDM generation (total generation time is around 5 minutes). Rendering for refinement is done at 512x512 resolution. Learning rates for Gaussian parameters are set adaptively (e.g., 0.0001 for position, 0.01 for rotation/opacity/color, 0.001 for scale), and the learnable text embedding $\hat{y}$ has a learning rate of 0.001. Timesteps for denoising are annealed.

In experiments, Director3D demonstrates superior performance. Qualitatively, it generates more realistic 3D scenes with consistent backgrounds, better lighting, shadows, and reflections compared to methods like GRM (synthetic, object-only), GaussianDreamer (oversaturated), DreamScene (oversaturated, cartoonish), and LucidDreamer (inconsistent, depth artifacts). Quantitative metrics like BRISQUE, NIQE (lower is better image quality), and CLIP-Score (higher is better text alignment) show that Director3D significantly outperforms baseline methods, including its own version before refinement, on a T3Bench subset. Ablation studies confirm the importance of each component of the SDS++ loss for achieving high visual quality and the necessity of using scene-specific camera trajectories generated by Traj-DiT for coherent scene generation.

Implementation considerations include the computational cost, particularly for training the GM-LDM on large datasets and the iterative SDS++ refinement. The current GM-LDM supports a limited number of views, which restricts the range of camera motion it can handle directly. The open-world generalization, while improved by collaborative training and refinement, still struggles with highly complex or compositional prompts, exact object counts, and articulated objects. Despite these limitations, Director3D provides a robust framework for generating realistic, explorable 3D scenes from text, showcasing the potential of leveraging real-world multi-view data and structured diffusion models for this task. Future work could explore supporting more views, incorporating more diverse datasets, and improving efficiency and robustness for challenging prompts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/taziku_co/status/1816096447139569851