MegaScenes: Scene-Level View Synthesis at Scale (2406.11819v2)

Published 17 Jun 2024 in cs.CV

Abstract: Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments, we validate the effectiveness of both our dataset and method on generating in-the-wild scenes. For details on the dataset and code, see our project page at https://megascenes.github.io.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a MegaScenes dataset with 430,000 scenes and 2 million images, providing detailed 3D annotations and diverse lighting conditions.
The method leverages pose-conditioned diffusion and warp-conditioned generation to enhance reconstruction fidelity and pose accuracy.
Comprehensive evaluations using metrics like LPIPS, PSNR, SSIM, FID, and KID demonstrate significantly improved performance over previous approaches.

MegaScenes: Scene-Level View Synthesis at Scale

The paper MegaScenes: Scene-Level View Synthesis at Scale introduces a novel dataset, MegaScenes, meticulously curated for advancing scene-level novel view synthesis (NVS). The authors address the limitations of existing NVS systems, which are typically object-centric and restricted by the scarcity and lack of diversity of available data, by employing a large-scale dataset constructed from eight million openly licensed internet images.

Key Contributions

Dataset Creation and Characteristics

MegaScenes stands out with the following features:

Scale and Diversity: The dataset contains around 430,000 scenes with over 100,000 structure-from-motion (SfM) reconstructions and 2 million registered images. The scenes bracket an extensive range of categories including statues, bridges, towers, religious buildings, and natural landscapes like the Teide volcano.
3D Annotations: The dataset provides 3D annotations comprising keypoints, descriptors, reconstructions, and camera poses.
Variability: The dataset captures scenes under various lighting conditions, times of day, weather scenarios, and with different camera intrinsics, supporting robust model training.

Technical Approach

The construction of MegaScenes leverages a sophisticated data pipeline:

Scene Identification: Potential scenes are identified through Wikimedia categories. Images and metadata are subsequently downloaded.
SfM Reconstruction: Using COLMAP, SfM is applied to the images, producing point clouds and camera poses. Erroneous reconstructions are rectified using the Doppelgangers pipeline.
Filtering and Conditioning: The images are further conditioned by ensuring lighting consistency and visual overlap to generate suitable image pairs for NVS training.

Novel View Synthesis Application

The paper critically evaluates the application of MegaScenes in novel view synthesis:

Pose-Conditioned Diffusion Models: The authors validate the dataset through extensive experiments, fine-tuning state-of-the-art NVS models such as Zero-1-to-3 and ZeroNVS. The models, when trained on MegaScenes, show significant improvements on various dataset benchmarks, outperforming their previous counterparts.
Warp-Conditioned Generation: A key innovation is the introduction of warp-conditioning to enhance pose accuracy. This method leverages depth estimation for warping input views into target positions, supplemented by extrinsic matrix conditioning to improve consistency and maintain visual fidelity.

Evaluation and Results

The paper provides a comprehensive evaluation using qualitative and quantitative methods:

Performance Metrics: The authors employ metrics such as LPIPS for perceptual similarity, PSNR, SSIM for reconstruction fidelity, and FID, KID for generative quality. The proposed methods show superior performance across these metrics.
Visual Quality: The generated images from the fine-tuned models showcase enhanced realism and pose accuracy, with improved object positioning and structural detail.

Implications and Future Directions

The implications of this research are manifold:

Enhanced 3D Learning: MegaScenes can significantly enhance the generalization capabilities of vision models, benefiting a wide range of applications including pose estimation, feature matching, and reconstruction.
AI and Robotics: The ability to synthesize consistent and realistic views from sparse inputs can be a game-changer for AR/VR applications and robotic navigation.
Extended Usage: The methodology and data pipeline of MegaScenes could be adapted for other vision tasks, promoting better usability and scalability in dataset creation.

Conclusion

MegaScenes heralds a significant milestone in scene-level novel view synthesis. By addressing the limitations of existing datasets and introducing robust, innovative methods for pose-conditioned generation, the paper lays a foundation for future advancements in 3D vision and related applications. This research promises to propel the field forward, enabling AI models to better understand and reconstruct complex real-world scenes.

PDF Markdown

Related Papers

GitHub

MegaScenes
GitHub - MegaScenes/nvs (81 stars)

Tweets

https://twitter.com/gene_ch0u/status/1803614414223900674

https://twitter.com/ducha_aiki/status/1806238523546046822