Emergent Mind

MultiDiff: Consistent Novel View Synthesis from a Single Image

(2406.18524)
Published Jun 26, 2024 in cs.CV

Abstract

We introduce MultiDiff, a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature, as there exist multiple, plausible explanations for unobserved areas. To address this issue, we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views, increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation, MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements, while reducing inference time by an order of magnitude. For additional consistency and image quality improvements, we introduce a novel, structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet. Finally, our model naturally supports multi-view consistent editing without the need for further tuning.

MultiDiff: pose-conditional diffusion model for novel view synthesis using a single image.

Overview

  • MultiDiff introduces a novel method for synthesizing new views from a single RGB image by leveraging monocular depth prediction and video diffusion models to maintain geometric stability and temporal consistency.

  • The method uses a latent diffusion model enhanced with monocular depth and video priors, as well as a structured noise distribution, to generate frame sequences jointly, reducing error accumulation and inference time.

  • Experimental evaluations on datasets like RealEstate10K and ScanNet show that MultiDiff outperforms state-of-the-art methods in key metrics such as PSNR, LPIPS, FID, KID, and FVD.

MultiDiff: Consistent Novel View Synthesis from a Single Image

The paper "MultiDiff: Consistent Novel View Synthesis from a Single Image" introduces an innovative methodology for synthesizing novel views of a scene using only a single reference RGB image as input. This task is inherently ill-posed due to the limited information available in a single image to accurately predict unobserved areas. The proposed method leverages strong priors through monocular depth prediction and video diffusion models to address this challenge, ensuring geometric stability and temporal consistency in the generated views.

Methodology

MultiDiff employs a latent diffusion model framework enhanced with complementary priors to generate consistent novel views. The key components and contributions of the method are:

Priors and Conditioning:

  • Monocular Depth Prediction: The model uses monocular depth estimators to condition on warped reference images for the target views. This improves geometric stability despite errors and noise in the depth prediction.
  • Video Diffusion Models: These models serve as a proxy for 3D scene understanding, allowing the model to maintain pixel-accurate correspondences across generated frames, thus enhancing temporal consistency.

Structured Noise Distribution:

  • A novel structured noise distribution is introduced, which ensures that noise is correlated across different views, further improving multi-view consistency.

Joint Frame Synthesis:

  • Unlike autoregressive models that suffer from error accumulation over long sequences, MultiDiff synthesizes the entire sequence of frames jointly. This significantly reduces inference time and maintains high fidelity across large camera movements.

Experimental Results

MultiDiff was evaluated on two challenging datasets: RealEstate10K and ScanNet. The method outperformed state-of-the-art approaches in several key metrics, demonstrating its effectiveness in both short-term and long-term novel view synthesis:

  • Short-term View Synthesis (RealEstate10K):
  • Achieved a PSNR of 16.41 and an LPIPS of 0.318.
  • Demonstrated lower FID (25.30) and KID scores (0.003) compared to baselines.
  • Long-term View Synthesis (RealEstate10K):
  • Recorded significant improvements in FID (28.25) and KID (0.004), indicating better image quality over extended sequences.
  • Outperformed other methods in maintaining temporal consistency with a lower FVD score (94.37).
  • ScanNet:
  • On this dataset characterized by rapid and diverse camera movements, MultiDiff again showed superior performance in terms of fidelity and consistency, with a PSNR of 15.50 and an LPIPS of 0.356 for short-term synthesis.

Ablation Studies

Ablation experiments underscored the importance of the key components of the MultiDiff framework:

  1. Priors: Removal of either monocular depth or video priors resulted in notable performance degradation, highlighting their critical role in improving geometric stability and consistency.
  2. Structured Noise: The use of structured noise significantly enhanced multi-view consistency, as demonstrated by improvements in FID and mTSED scores.

Implications and Future Work

The ability to generate consistent novel views from a single image opens up numerous practical applications, including augmented reality, 3D content creation, and virtual reality. The robust performance of MultiDiff across different datasets exemplifies the potential of integrating strong priors in generative models to tackle highly ill-posed problems.

From a theoretical standpoint, this work reinforces the effectiveness of video diffusion models in learning temporal consistencies and extends their utility to novel view synthesis. The introduction of structured noise distribution also presents a novel technique to enhance the consistency of generated frames.

Future research could explore further extensions of this methodology to incorporate additional priors or to improve the handling of more complex and dynamic scenes. Additionally, investigating real-time deployment of such models and reducing computational overhead could be valuable for practical implementations.

Conclusion

MultiDiff represents a considerable advancement in novel view synthesis, leveraging monocular depth and video diffusion models to achieve high-quality, consistent results from a single input image. Its demonstrated superiority over existing methods marks a significant step toward practical applications in 3D scene rendering and virtual environments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.