ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image (2310.17994v2)

Published 27 Oct 2023 in cs.CV and cs.GR

Abstract: We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically, we train a generative prior on a mixture of data sources that capture object-centric, indoor, and outdoor scenes. To address issues from data mixture such as depth-scale ambiguity, we propose a novel camera conditioning parameterization and normalization scheme. Further, we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes, and propose "SDS anchoring" to improve the diversity of synthesized novel views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting, even outperforming methods specifically trained on DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis, and demonstrate strong performance in this setting. Our code and data are at http://kylesargent.github.io/zeronvs/

Citations (15)

View on Semantic Scholar

Summary

The paper presents a novel 3D-aware diffusion approach for zero-shot 360° view synthesis, significantly improving scene detail and background diversity.
It introduces an innovative 6DoF+1 camera conditioning and scene normalization method that reduces ambiguity and enhances prediction accuracy in complex, multi-object scenes.
SDS anchoring is employed to overcome standard score distillation limits, achieving state-of-the-art performance on challenging benchmarks like DTU.

Overview of ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image

The paper introduces ZeroNVS, a 3D-aware diffusion model for novel view synthesis (NVS) from a single image in complex real-world scenes. Unlike existing techniques primarily focused on single objects with simple backgrounds, ZeroNVS addresses the challenges posed by multi-object scenes with intricate backgrounds. The authors propose innovative solutions such as a new camera conditioning parameterization, normalization scheme, and a novel sampling technique termed "SDS anchoring" to enhance synthesized view diversity.

Key Contributions

Multi-Dataset Generative Prior Training: ZeroNVS trains its generative model on a mixture of datasets covering object-centric, indoor, and outdoor scenes, including CO3D, RealEstate10K, and ACID. This strategy enables handling a variety of scene complexities and camera settings, surpassing the typical object-focused datasets like Objaverse-XL.
Camera Conditioning and Scale Normalization: The paper identifies the inadequacies in prior camera conditioning methods that are either ambiguous or insufficient for real-world scenes. ZeroNVS proposes a "6DoF+1" representation, enhancing it with a viewer-centric normalization scheme. This accounts for the scale of visible content in the input, thereby minimizing randomness in view synthesis and improving prediction accuracy.
SDS Anchoring for Enhanced Diversity: Standard Score Distillation Sampling (SDS) often limits background diversity in generated scenes. SDS anchoring counteracts this by drawing several "anchor" views, employing them in SDS to inform the diversity of synthesized views. This approach particularly improves scenes' background variety without compromising 3D consistency.
Benchmarking and Performance Evaluation: ZeroNVS exhibits state-of-the-art performance on the DTU dataset, achieving superior LPIPS scores even compared to models fine-tuned on DTU. Furthermore, the adoption of the Mip-NeRF 360 dataset introduces a challenging benchmark for evaluating 360-degree NVS capabilities. The model demonstrates strong zero-shot generalization, reinforcing its practical applicability.
Implications for 3D Scene Understanding: By enabling robust zero-shot NVS for complex scenes, ZeroNVS opens possibilities for advancements in various applications, such as augmented reality, autonomous driving, and robotics, where understanding scenes from limited viewpoints is crucial.

Technical Insights

Diffusion Model Training: ZeroNVS builds on the diffusion model architecture of Zero-1-to-3, substituting robust conditioning modules to accommodate real-world 6DoF scenes.
Scene Normalization: Introducing depth-and-view-based scene normalization aligns various datasets, leading to improved generalization and performance consistency across diverse scene types.
Computational Efficiency: The methods maintain efficiency akin to previous models while significantly improving on scene-level complexities.

Future Directions

Cross-Dataset Scalability: Further exploration could enhance ZeroNVS's adaptability to other emergent multiview datasets, optimizing its flexibility for broader NVS applications.
Advanced Representation Methods: The development of more sophisticated camera and scene representations could refine the model's handling of complex real-world data.
Enhanced 3D Consistency Techniques: Continuing to improve upon SDS anchoring could allow for even greater creative generation capabilities, unlocking more realistic synthetic scene constructs.

In conclusion, ZeroNVS sets a new direction in 3D-aware diffusion models by effectively bridging gaps between simplistic object-centric approaches and the complexities of real-world scene synthesis. The paper's contributions represent a significant step forward in zero-shot view synthesis, paving the way for future innovations in AI-driven scene understanding.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/mctalentowen/status/1784769714096685112