A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose (2405.03659v2)

Published 6 May 2024 in cs.CV and cs.GR

Abstract: Novel view synthesis from a sparse set of input images is a challenging problem of great practical interest, especially when camera poses are absent or inaccurate. Direct optimization of camera poses and usage of estimated depths in neural radiance field algorithms usually do not produce good results because of the coupling between poses and depths, and inaccuracies in monocular depth estimation. In this paper, we leverage the recent 3D Gaussian splatting method to develop a novel construct-and-optimize method for sparse view synthesis without camera poses. Specifically, we construct a solution progressively by using monocular depth and projecting pixels back into the 3D world. During construction, we optimize the solution by detecting 2D correspondences between training views and the corresponding rendered images. We develop a unified differentiable pipeline for camera registration and adjustment of both camera poses and depths, followed by back-projection. We also introduce a novel notion of an expected surface in Gaussian splatting, which is critical to our optimization. These steps enable a coarse solution, which can then be low-pass filtered and refined using standard optimization methods. We demonstrate results on the Tanks and Temples and Static Hikes datasets with as few as three widely-spaced views, showing significantly better quality than competing methods, including those with approximate camera pose information. Moreover, our results improve with more views and outperform previous InstantNGP and Gaussian Splatting algorithms even when using half the dataset. Project page: https://raymondjiangkw.github.io/cogs.github.io/

Authors (7)

Kaiwen Jiang (9 papers)
Yang Fu (43 papers)
Mukund Varma T (10 papers)
Yash Belhe (5 papers)
Xiaolong Wang (243 papers)
Hao Su (218 papers)
Ravi Ramamoorthi (65 papers)

Citations (8)

View on Semantic Scholar

Summary

Sparse View Synthesis Sans Camera Pose Estimation

Introduction

Sparse view synthesis is quite the puzzle when it comes to reconstructing 3D scenes from a minimal set of 2D images, primarily when these images lack associated camera poses. Normally, methods like Neural Radiance Field (NeRF) demand numerous views with precisely known camera positions, which isn't always practical. The paper I'm discussing today dives into this problem by fostering a method that constructs and optimizes a solution in a world where camera poses are unknown or unreliable. By skillfully manipulating monocular depth and detecting 2D correspondences between views, the authors present a novel pathway to synthesize new views from as few as three images without initial camera pose estimation.

The Approach

To comprehend the stride this paper makes, let's break down their methodology into digestible bits:

Initial Setup: They start with a basic assumption where the first image in a sequence is taken as the baseline with an identity camera pose. This image, along with its estimated depth, sets the scene for further steps.
Progressive Construction and Optimization:
- Camera Pose Estimation: Each subsequent view is initially presumed to have the same pose as the previous one but is refined through optimization to better align with the existing 3D reconstruction.
- Depth Adjustment: Alongside camera optimization, depth estimations are adjusted to maintain consistency across different views, enhancing the cohesion of the constructed 3D space.
- Back-Projection: Pixels are back-projected based on adjusted depths and refined camera poses to progressively build the 3D scene.
Rendering and Refinement:
- Before final optimization, a low-pass filtering strategy is used to smooth out high-frequency noise.
- The scene undergoes a refinement process to enhance details and ensure the newly synthesized views are as crisp and accurate as possible.

Why It Matters

Utilizing sparse views for 3D reconstruction underpins several practical and theoretical implications:

Practical Utilization: This technique can significantly reduce the need for extensive hardware setups typically required for capturing multiple views with known camera poses, potentially lowering the cost and complexity of various 3D modeling tasks.
Theoretical Advancement: The method challenges the conventional reliance on dense sampling and precise camera poses, pushing the envelope on what can be achieved with limited data — a leap towards more robust and flexible 3D reconstruction techniques.

Performance and Comparisons

The results are quite impressive:

The method outperforms previous techniques that do or do not require camera poses, across several benchmarks.
More notably, the quality of the synthesized views improves with additional views but already surpasses other methods with fewer views.

Forward-Looking Statements

What's next for view synthesis from sparse inputs? This paper lays a strong foundation, but there are avenues ripe for exploration:

Handling Unordered Collections: Adapting the framework to manage unordered image sets could widen its applicability, especially in scenarios where sequential data capture is challenging.
Enhancing Depth Adjustment: Further improvements in how depth estimation is integrated and adjusted could refine the reconstructions even further.

Conclusion

By constructing and optimizing a solution iteratively for sparse view synthesis without known camera poses, the authors carve a niche for practical, cost-effective 3D scene reconstructions. As we look forward to the evolution of this technology, the promise it holds for both academic inquiry and real-world application continues to expand, pushing us to rethink the boundaries of current methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1787688935382024680

https://twitter.com/zhenjun_zhao/status/1787689909937213502