GFlow: Recovering 4D World from Monocular Video (2405.18426v2)

Published 28 May 2024 in cs.CV and cs.AI

Abstract: Recovering 4D world from monocular video is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view videos, known camera parameters, or static scenes. In this paper, we relax all these constraints and tackle a highly ambitious but practical task: With only one monocular video without camera parameters, we aim to recover the dynamic 3D world alongside the camera poses. To solve this, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video to a 4D scene, as a flow of 3D Gaussians through space and time. GFlow starts by segmenting the video into still and moving parts, then alternates between optimizing camera poses and the dynamics of the 3D Gaussian points. This method ensures consistency among adjacent points and smooth transitions between frames. Since dynamic scenes always continually introduce new visual content, we present prior-driven initialization and pixel-wise densification strategy for Gaussian points to integrate new content. By combining all those techniques, GFlow transcends the boundaries of 4D recovery from causal videos; it naturally enables tracking of points and segmentation of moving objects across frames. Additionally, GFlow estimates the camera poses for each frame, enabling novel view synthesis by changing camera pose. This capability facilitates extensive scene-level or object-level editing, highlighting GFlow's versatility and effectiveness. Visit our project page at: https://littlepure2333.github.io/GFlow

References (42)

Citations (8)

View on Semantic Scholar

Summary

The paper presents a method that recovers dynamic 4D scenes from a single uncalibrated video using explicit 3D Gaussian Splatting.
It leverages scene clustering and alternating optimization to separate static and moving elements while refining camera poses and Gaussian points.
Experimental results on DAVIS and Tanks and Temples datasets show significant improvements in PSNR, SSIM, and LPIPS compared to prior methods.

GFlow: Dynamic 4D Reconstruction from Monocular Video Inputs Using Gaussian Splatting

This paper presents "GFlow", a method for reconstructing dynamic 4D scenes from monocular video inputs, a task referred to as "AnyV4D". GFlow represents an advancement over conventional methods that rely on multi-view video inputs, pre-calibrated camera parameters, or static scenes. The proposed approach dispenses with these constraints, making it particularly suitable for in-the-wild scenarios where only a single uncalibrated video is available.

Overview

GFlow leverages explicit 3D Gaussian Splatting (3DGS) to model video content as a flow of Gaussian points through space and time, relying purely on 2D priors such as depth and optical flow. The system is organized around the following critical components:

Scene Clustering: The video scene is segregated into still and moving parts, managed via a K-Means clustering algorithm.
Sequential Optimization: An iterative optimization process refines camera poses and dynamically adjusts the 3D Gaussians based on RGB, depth, and optical flow constraints.
Pixel-wise Densification: A novel strategy that dynamically introduces new Gaussian points to represent newly revealed content, enhancing the fidelity of the dynamic scene.

Contributions and Methodology

Scene Clustering

Scene clustering categorizes the 3D Gaussian points into still and moving clusters at each frame, enabling more accurate optimization by distinguishing between static and dynamic components within the scene. Gaussian points are initially allocated based on their movements indicated by the optical flow map. For subsequent frames, Gaussian points inherit labels, and new points are clustered based on similarity to existing clusters.

Alternating Optimization

The method alternates between optimizing camera poses and Gaussian points. Initially, the camera extrinsics are tuned to align the still points using depth and optical flow priors. This aligns the camera transformations with the observed static background. Once the camera positions are refined, the Gaussian points are optimized to minimize differences in photometric attributes, depth consistency, and optical flow, ensuring smooth temporal coherence.

Initialization and Densification

To initialize Gaussian points, an edge-based texture probability map is used to prioritize areas with more complex textures. Depth estimates are obtained from monocular depth estimators, and Gaussian points are unprojected into 3D space. A pixel-wise densification strategy then enriches the Gaussian point representation iteratively, addressing areas with high photometric errors, ensuring detailed modeling of dynamic scene components.

Experimental Evaluation

The evaluation is conducted on DAVIS and Tanks and Temples datasets for reconstruction quality, object segmentation, and camera pose accuracy. Notably, GFlow significantly outperforms CoDeF in terms of PSNR, SSIM, and LPIPS, benefiting from explicit representation that adapts to dynamic scenes without compromising visual fidelity.

For DAVIS, GFlow achieves average PSNR, SSIM, and LPIPS scores of 29.5508, 0.9387, and 0.1067, respectively, compared to CoDeF's 24.8904, 0.7703, and 0.2932. For Tanks and Temples, GFlow scores 32.7258 in PSNR, 0.9720 in SSIM, and 0.0363 in LPIPS, highlighting robust performance even in complex scenarios.

Qualitative and quantitative analyses also indicate that GFlow maintains superior segmentation capabilities as a by-product. Without specific training for segmentation, GFlow's intrinsic clustering allows accurate tracking and segmentation of moving objects.

Implications and Future Work

The ability to reconstruct dynamic scenes from monocular videos has broad implications for various domains, including virtual and augmented reality, robotics, and advanced video editing. GFlow's framework opens avenues for novel view synthesis, scene editing, and object manipulation, facilitated by its explicit scene representation.

Future research could focus on enhancing GFlow's robustness by integrating advanced depth estimation and optical flow techniques, improving clustering strategies, and adopting a more refined global optimization approach. Given its potential, GFlow is poised to influence further developments in dynamic scene reconstruction and understanding.

In conclusion, GFlow introduces a comprehensive framework for 4D reconstruction from monocular video, excelling in dynamic scene fidelity, flexibility, and practical utility. This research offers a foundational methodology likely to inspire subsequent advances in computer vision and related fields.

PDF Markdown

Related Papers

GitHub

GFlow: Recovering 4D World from Monocular Video

Tweets

https://twitter.com/_akhaliq/status/1795676461203951766

https://twitter.com/yxy2168/status/1795650913149026522

https://twitter.com/zhenjun_zhao/status/1795831763790618667

https://twitter.com/ShizunWang/status/1795668127289909695

https://twitter.com/arxivsanitybot/status/1795809492019364343

https://twitter.com/fly51fly/status/1795940111848485070