Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation

Published 2 Apr 2021 in cs.CV, cs.LG, and stat.ML | (2104.01148v1)

Abstract: We present ObSuRF, a method which turns a single image of a scene into a 3D model represented as a set of Neural Radiance Fields (NeRFs), with each NeRF corresponding to a different object. A single forward pass of an encoder network outputs a set of latent vectors describing the objects in the scene. These vectors are used independently to condition a NeRF decoder, defining the geometry and appearance of each object. We make learning more computationally efficient by deriving a novel loss, which allows training NeRFs on RGB-D inputs without explicit ray marching. After confirming that the model performs equal or better than state of the art on three 2D image segmentation benchmarks, we apply it to two multi-object 3D datasets: A multiview version of CLEVR, and a novel dataset in which scenes are populated by ShapeNet models. We find that after training ObSuRF on RGB-D views of training scenes, it is capable of not only recovering the 3D geometry of a scene depicted in a single input image, but also to segment it into objects, despite receiving no supervision in that regard.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (104)

View on Semantic Scholar

Summary

The paper introduces ObSuRF, an unsupervised approach that decomposes 3D scenes into distinct object representations from a single image using neural radiance fields.
It leverages an encoder to generate latent vectors that condition individual NeRF decoders, capturing both geometry and appearance efficiently.
The method outperforms or matches state-of-the-art 2D segmentation benchmarks while generalizing robustly across complex multi-object 3D datasets.

The paper "Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation" presents ObSuRF, a novel method designed to generate 3D models from single images. These models are represented by a set of Neural Radiance Fields (NeRFs), where each field corresponds to different objects within the scene.

Methodology

ObSuRF operates by first utilizing an encoder network which processes the input image to output a set of latent vectors. Each of these vectors independently conditions a NeRF decoder. This step is crucial as it allows the model to define both the geometry and appearance of each object in the scene. A significant innovation in this paper is the novel loss function which enhances computational efficiency. This function allows training NeRFs with RGB-D inputs while bypassing the need for explicit ray marching, which is typically computationally expensive.

Evaluation and Performance

For evaluation, ObSuRF was compared to state-of-the-art methods on three different 2D image segmentation benchmarks, where it performed equally well or better. This is noteworthy as it shows that the method's segmentation capability is robust, even though it primarily focuses on 3D reconstruction.

3D Datasets and Generalization

The evaluation extended to two multi-object 3D datasets:

A multiview version of CLEVR.
A new dataset populated by ShapeNet models.

These datasets provided diverse and complex scenes for comprehensive testing. After being trained on RGB-D views of scenes from these datasets, ObSuRF demonstrated the ability to not only recover the 3D geometry from a single input image but also segment the scene into individual objects. Importantly, this segmentation was achieved without any explicit supervision, highlighting the unsupervised capabilities of the method.

Significance and Contributions

The contributions of this paper are multifaceted:

It introduces an unsupervised approach to decomposing single images into 3D object representations.
The method leverages a novel loss function to enable more efficient training.
It achieves competitive or superior performance in 2D segmentation tasks and successfully generalizes to complex 3D scenes.

Overall, ObSuRF represents a significant advancement in the area of 3D scene understanding and segmentation, with broad potential applications in computer vision and graphics. Its ability to autonomously learn from unsupervised data and efficiently process 3D structures from 2D inputs marks an important step forward in the development of neural rendering and scene decomposition techniques.

Markdown Report Issue