GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time (2403.10147v2)

Published 15 Mar 2024 in cs.CV

Abstract: This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model. With the joint learning mechanism, the proposed framework can inherently estimate robust relative pose information from the image observations and thus primarily alleviate the requirement of real camera poses. Moreover, we implement a deferred back-propagation mechanism that enables high-resolution training and inference, overcoming the resolution constraints of previous methods. To enhance the speed and efficiency, we further introduce a progressive Gaussian cache module that dynamically adjusts during training and inference. As the first pose-free generalizable 3D-GS framework, GGRt achieves inference at $\ge$ 5 FPS and real-time rendering at $\ge$ 100 FPS. Through extensive experimentation, we demonstrate that our method outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness. It can also approach the real pose-based 3D-GS methods. Our contributions provide a significant leap forward for the integration of computer vision and computer graphics into practical applications, offering state-of-the-art results on LLFF, KITTI, and Waymo Open datasets and enabling real-time rendering for immersive experiences.

Citations (2)

View on Semantic Scholar

Summary

The paper presents GGRt, a novel framework that eliminates the need for camera poses by integrating IPO-Net with a generalizable 3D Gaussian model.
It employs deferred back-propagation and a progressive Gaussian cache to process high-resolution images efficiently, achieving ≥5 FPS inference and ≥100 FPS rendering.
Extensive experiments show that GGRt outperforms pose-free NeRF methods while matching or exceeding the performance of traditional pose-based techniques.

GGRt: A Novel Framework for Real-Time Generalizable 3D Gaussian Splatting without Camera Poses

Introduction

Innovations in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D-GS) have significantly narrowed the gap between computer vision and computer graphics, particularly in image-based novel view synthesis and 3D reconstruction. Despite these advancements, several challenges persist, notably the dependency on accurate camera poses, the complexity of processing high-resolution images, and the extensive optimization processes required. The paper presents GGRt, a novel framework designed to address these challenges by offering a real-time, pose-free, generalizable approach to 3D Gaussian Splatting.

Key Contributions

The key contributions of the paper can be condensed into the following points:

A novel pipeline combining an Iterative Pose Optimization Network (IPO-Net) with a Generalizable 3D-Gaussians (G-3DG) model. This combination facilitates the robust estimation of relative camera poses directly from image observations, bypassing the need for pre-acquired camera poses.
The introduction of a deferred back-propagation mechanism, which enables efficient high-resolution training and inference, breaking through the limitations faced by previous methods.
A progressive Gaussian cache module is devised to dynamically adjust during training and inference to accelerate processing speeds further, showcasing the framework's capability of inference at $\ge$ 5 FPS and real-time rendering at $\ge$ 100 FPS.
Extensive experimentation validating GGRt's superior performance against existing pose-free NeRF-based techniques in terms of inference speed and effectiveness. Additionally, GGRt demonstrates comparable, if not superior, results to pose-based 3D-GS methods, thereby affirming its practical applicability.

Methodology

Framework Overview

GGRt's architecture is centered around joint learning between IPO-Net for iterative pose optimization and G-3DG for generalizable Gaussian point generation. This setup inherently facilitates pose-free operation by estimating relative poses through a shared image encoder that extracts geometric and semantic cues essential for robust 3D reconstruction.

Iterative Pose Optimization Network (IPO-Net)

IPO-Net is designed to iteratively estimate the relative camera poses leveraging both photometric loss and edge-aware smoothness constraints. This component significantly contributes to the elimination of the requirement for pre-determined real camera poses.

Generalizable 3D-Gaussians (G-3DG)

G-3DG capitalizes on the generation of Gaussian points from reference images, which are leveraged to synthesize novel views. The module incorporates innovative elements like epipolar sampling, cross-attention, and local self-attention mechanisms to ensure precise 3D reconstruction.

Gaussians Cache Mechanism

This mechanism is crucial for avoiding redundant computation by dynamically storing, querying, and releasing Gaussian points. It significantly contributes to the framework's efficiency, enabling rapid inferencing and rendering.

Deferred Back-propagation

To tackle challenges related to GPU memory constraints, the framework employs a deferred back-propagation strategy. This approach allows for efficient handling of high-resolution images during both training and inference phases.

Experimental Results

GGRt's capabilities are thoroughly assessed through extensive experiments across various datasets including LLFF, KITTI, and Waymo Open. These evaluations demonstrate GGRt's superior performance in inference speed and effectiveness compared to existing techniques. Remarkably, GGRt achieves real-time rendering speeds exceeding 100 FPS, a significant advancement in the field.

Implications and Future Directions

The development of GGRt presents significant implications for the practical application of novel view synthesis and real-time 3D reconstruction across diverse areas such as virtual reality and immersive entertainment. The framework's ability to operate without the need for real camera poses, combined with its efficiency in handling high-resolution images, marks a progressive step forward. Looking ahead, further exploration into enhancing the robustness of pose estimation and expanding the framework's applicability to even larger-scale scenes presents exciting avenues for future research.

In conclusion, GGRt emerges as a groundbreaking framework in the field of generalizable 3D Gaussian Splatting, promising enhancements in the efficacy and applicability of novel view synthesis techniques.

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1769619338603176435