Emergent Mind

Abstract

Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans.

OccFusion's three-stage process for occluded human rendering includes initialization, complete binary mask recovery, and final rendering.

Overview

  • The paper addresses the challenge of rendering 3D humans from monocular videos, especially in occlusion scenarios, through a method called OccFusion that combines 3D Gaussian splatting with pretrained 2D diffusion models.

  • OccFusion operates via a three-stage pipeline: Initialization (generating complete human occupancy masks), Optimization (optimizing the rendered 3D Gaussian humans with Score-Distillation Sampling), and Refinement (enhancing rendering quality through in-context inpainting).

  • Evaluation on the ZJU-MoCap and OcMotion datasets demonstrates OccFusion's state-of-the-art performance, providing high-fidelity, artifact-free renderings with superior quantitative metrics (PSNR, SSIM, and LPIPS) compared to existing methods.

Overview of "OccFusion: Rendering Occluded Humans with Generative Diffusion Priors"

"OccFusion: Rendering Occluded Humans with Generative Diffusion Priors" by Adam Sun, Tiange Xiang, Scott Delp, Li Fei-Fei, and Ehsan Adeli presents a novel methodology designed to address the persistent challenge of rendering 3D humans from monocular videos, especially in scenarios involving occlusion. This problem is critical in numerous fields such as virtual and augmented reality, healthcare, and sports analytics. Traditional methods have largely ignored the issue of occlusion, assuming unobstructed views of the human subjects. This paper introduces "OccFusion", a method that leverages 3D Gaussian splatting in combination with pretrained 2D diffusion models to achieve high-fidelity human rendering despite occlusions.

Methodology

OccFusion operates via a structured, multi-stage pipeline consisting of three sequential stages: Initialization, Optimization, and Refinement.

Initialization Stage:

  • This stage generates complete human occupancy masks from partial visibility masks. The authors acknowledge the weaknesses of direct inpainting via diffusion models in challenging poses, particularly when dealing with self-occlusions. To address this, a simplified representation of pose priors is introduced, improving the diffusion model's ability to generate feasible outputs. The results are binary human masks extracted from inpainted images, which offer greater cross-frame consistency.

Optimization Stage:

  • In this stage, the rendered 3D human Gaussians are optimized, with additional supervision provided by Score-Distillation Sampling (SDS). The concept of utilizing diffusion priors to ensure completeness of the body model is novel and effective. The authors employ SDS to regularize the canonical pose, ensuring a consistent and complete 3D representation across arbitrary viewing angles.

Refinement Stage:

  • The final stage involves enhancing the rendering quality using in-context inpainting. Here, coarse renderings from the Optimization stage are utilized as contextual references for the diffusion model to generate high-fidelity appearances for occluded regions. This approach significantly refines both the appearance and the geometry of the human model, ensuring the final rendering is of superior quality.

Results

The paper’s evaluation of OccFusion on the ZJU-MoCap and OcMotion datasets demonstrates its superiority over existing methods. The quantitative metrics, including Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS), highlight the method's effectiveness. Notably, OccFusion achieves state-of-the-art performance with higher PSNR and SSIM values, and considerably lower LPIPS values compared to other benchmarks. Qualitative evaluations further reveal that OccFusion provides sharp, artifact-free renderings, even under complex occlusions.

Implications and Future Directions

The practical implications of OccFusion are profound. In healthcare, for example, accurate 3D reconstructions of occluded human figures can enhance remote patient monitoring and telemedicine. In sports analytics, it can provide precise replay analysis of athletes occluded by equipment or other players. Additionally, in augmented reality, it enables more realistic and immersive user experiences by accurately reconstructing environments with occluded humans.

Theoretically, this research illuminates the potential of combining explicit geometric techniques (like 3D Gaussian splatting) with rapid advancements in generative models (such as diffusion priors). The structured approach of isolating and addressing specific weaknesses at different stages of the pipeline is particularly noteworthy.

Future directions could involve exploring more sophisticated generative models tailored exclusively for human rendering tasks, potentially improving cross-frame consistency and overall rendering fidelity. Further integration with real-time processing systems could also be explored, enabling instant applications in dynamic environments.

In conclusion, "OccFusion" pushes the boundaries of monocular human rendering, presenting a significant step forward in addressing occlusions. The robust experimental evidence provided validates its place as a leading technique in the domain, pointing towards a fertile ground for subsequent innovations.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.