Emergent Mind

Abstract

Empowering 3D Gaussian Splatting with generalization ability is appealing. However, existing generalizable 3D Gaussian Splatting methods are largely confined to narrow-range interpolation between stereo images due to their heavy backbones, thus lacking the ability to accurately localize 3D Gaussian and support free-view synthesis across wide view range. In this paper, we present a novel framework FreeSplat that is capable of reconstructing geometrically consistent 3D scenes from long sequence input towards free-view synthesis.Specifically, we firstly introduce Low-cost Cross-View Aggregation achieved by constructing adaptive cost volumes among nearby views and aggregating features using a multi-scale structure. Subsequently, we present the Pixel-wise Triplet Fusion to eliminate redundancy of 3D Gaussians in overlapping view regions and to aggregate features observed across multiple views. Additionally, we propose a simple but effective free-view training strategy that ensures robust view synthesis across broader view range regardless of the number of views. Our empirical results demonstrate state-of-the-art novel view synthesis peformances in both novel view rendered color maps quality and depth maps accuracy across different numbers of input views. We also show that FreeSplat performs inference more efficiently and can effectively reduce redundant Gaussians, offering the possibility of feed-forward large scene reconstruction without depth priors.

FreeSplat framework for depth map prediction and feature extraction using pixel-aligned triplet fusion.

Overview

  • FreeSplat is an innovative framework designed to overcome the limitations of existing 3D Gaussian Splatting methods by enabling accurate localization and free-view synthesis of 3D Gaussians from long sequences of input images.

  • The FreeSplat framework comprises three key components: Low-cost Cross-View Aggregation, Pixel-wise Triplet Fusion (PTF), and Free-View Training (FVT), which together enhance the capability to synthesize views from arbitrary poses while maintaining computational efficiency.

  • Empirical evaluations on datasets like ScanNet and Replica demonstrate FreeSplat's superior performance in terms of view interpolation, depth estimation, and real-time rendering, showcasing its potential for widespread applications in virtual and augmented reality.

FreeSplat: Generalizable 3D Gaussian Splatting for Real-Time Long Sequence Reconstruction and Free-View Synthesis

Introduction

The paper presents an innovative framework, FreeSplat, which addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods. The primary contributions are centered around the reconstruction of geometrically consistent 3D Gaussians from long sequences of input images and support for free-view synthesis across a wide range of viewpoints. Unlike previous methods constrained to narrow-range interpolation between stereo images, FreeSplat is designed to enable efficient, accurate localization of 3D Gaussians and can handle extensive view ranges.

Technical Overview

The proposed FreeSplat framework consists of three core components: Low-cost Cross-View Aggregation, Pixel-wise Triplet Fusion (PTF), and a novel Free-View Training (FVT) strategy. Each of these components contributes to FreeSplat's capability to localize 3D Gaussians accurately and synthesize views from arbitrary poses.

  1. Low-cost Cross-View Aggregation:

    • This component involves constructing adaptive cost volumes among nearby views and aggregating features using a multi-scale structure.
    • It employs efficient CNN-based backbones to balance feature extraction and matching with computational feasibility.
    • The methodology enhances pose information integration by building cost volumes between nearby views, allowing for broader receptive fields and robust feature aggregation.
  2. Pixel-wise Triplet Fusion (PTF):

    • PTF is used to eliminate redundancy of 3D Gaussians in overlapping view regions and to aggregate features observed across multiple views. This is particularly crucial for real-time rendering and reducing computational load.
    • A Pixel-wise Alignment strategy corresponding local and global Gaussian triplets facilitates this fusion.
    • The approach progressively integrates Gaussian triplets with geometric constraints and learnable feature updates, ensuring efficient post-aggregation across multiple views.
  3. Free-View Training (FVT):

    • The FVT strategy disentangles the generalizable 3DGS performance from the specific number of input views by supervising rendered images in a broader view interpolations setting.
    • This training strategy ensures robust view synthesis across broader view ranges, contributing significantly to novel view depth rendering accuracy.

Empirical Results

The empirical evaluation of FreeSplat was conducted on the ScanNet and Replica datasets. Various experimental settings, including 2-view, 3-view, and 10-view sequences, were employed to assess the performance comprehensively.

View Interpolation on ScanNet:

- FreeSplat significantly outperformed existing methods with a PSNR improvement of over 1.67 dB in the 2-views setting and 1.48 dB in the 3-views setting. - In terms of efficiency, FreeSplat exhibited a much faster inference speed compared to NeuRay while delivering superior image synthesis quality.

Long Sequence Reconstruction on ScanNet:

- FreeSplat demonstrated marked improvements in both view interpolation and view extrapolation when provided with 10 input views. Results showed a PSNR gain of over 1.80 dB compared to previous 3DGS methods. - The FVT strategy provided an additional boost, enhancing PSNR by over 1.85 dB compared to the FreeSplat model trained without FVT.

Novel View Depth Rendering:

- FreeSplat achieved superior depth estimation accuracy across novel views, outperforming other methods significantly. The $\delta<1.25$ metric rose by over 27.0%, highlighting the framework's capacity to support accurate unsupervised depth estimation.

Zero-Shot Transfer to Replica

FreeSplat's generalizability was further validated through zero-shot evaluations on the Replica dataset. The framework maintained superior performance in view interpolation and novel view depth rendering. Despite some performance degradation in long sequence reconstructions due to domain gap and depth estimation inaccuracies, FreeSplat's flexible architecture and FVT strategy provided a strong foundation for future improvements in cross-domain applications.

Conclusion and Future Directions

FreeSplat contributes significantly to the field of 3D scene reconstruction and novel view synthesis. Its efficient feature aggregation, redundancy elimination, and adaptable training strategy offer a compelling solution for real-time rendering and large scene reconstruction. Future research may explore enhancements in zero-shot depth estimation and further optimizations to reduce computational overhead while maintaining high visual fidelity across diverse datasets. Additional studies could also focus on integrating depth priors or leveraging advanced neural architectures to improve cross-domain generalizability and real-time performance.

Implications

FreeSplat's advancements have noteworthy implications for various applications, including virtual reality, augmented reality, and photorealistic scene reconstruction. The framework's efficiency and adaptability make it particularly suited for interactive systems that require rapid rendering of high-quality views from multiple perspectives. The elimination of redundant Gaussians and robust handling of long sequences highlight FreeSplat's potential to impact real-world scenarios where computational resources and real-time performance are critical.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.