NeSF: Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes (2111.13260v3)

Published 25 Nov 2021 in cs.CV and cs.RO

Abstract: We present NeSF, a method for producing 3D semantic fields from posed RGB images alone. In place of classical 3D representations, our method builds on recent work in implicit neural scene representations wherein 3D structure is captured by point-wise functions. We leverage this methodology to recover 3D density fields upon which we then train a 3D semantic segmentation model supervised by posed 2D semantic maps. Despite being trained on 2D signals alone, our method is able to generate 3D-consistent semantic maps from novel camera poses and can be queried at arbitrary 3D points. Notably, NeSF is compatible with any method producing a density field, and its accuracy improves as the quality of the density field improves. Our empirical analysis demonstrates comparable quality to competitive 2D and 3D semantic segmentation baselines on complex, realistically rendered synthetic scenes. Our method is the first to offer truly dense 3D scene segmentations requiring only 2D supervision for training, and does not require any semantic input for inference on novel scenes. We encourage the readers to visit the project website.

Citations (116)

View on Semantic Scholar

Summary

The paper presents NeSF, which learns 3D semantic segmentation solely from posed 2D images and maps, achieving robust generalization to unseen scenes.
The paper leverages implicit neural representations and refined NeRF density fields to transform 2D semantic supervision into detailed 3D segmentations.
The paper introduces three synthetic datasets with over 1,000 scenes to benchmark its performance against established 2D and 3D segmentation methods.

Overview

Google Research introduces Neural Semantic Fields (NeSF), a method that facilitates the semantic segmentation of 3D scenes solely from posed RGB images. This approach builds upon the principles of implicit neural scene representations, enabling functions that capture 3D structures as point-wise functions. NeSF leverages posed 2D semantic maps to train a 3D semantic segmentation model, which can extract 3D-consistent semantic maps from novel viewpoints.

Methodology

The methodology involves creating a 3D semantic field by training a neural network using posed 2D images and corresponding semantic maps. Despite training on 2D data, this neural network can generalize to new scenes, producing semantic segmentation in both 2D and 3D. The accuracy of NeSF's predictions is intertwined with the quality of the underlying density field produced by methods such as NeRF (Neural Radiance Fields). Enhancements in the density field's quality directly translate to improvements in NeSF's segmentation capabilities. NeSF's ability to generalize to unseen scenes with sparse 2D semantic supervision holds the potential to scale 3D vision application deployments significantly.

Empirical Evaluation

NeSF's empirical robustness was tested across custom-built synthetic datasets, which feature diverse complexity levels, including the KLEVR, ToyBox5, and ToyBox13 datasets. The model showcases comparable performance to established 2D and 3D semantic segmentation baselines under controlled, synthetic conditions. Its performance particularly shines in providing truly dense 3D segmentations in novel scenes when only 2D supervision is available during training.

Contributions and Future Work

NeSF's main contribution is ushering the ability to generalize semantic segmentations in novel scenes without semantic input during inference. The method shows promise and is a step towards more comprehensive scene understanding leveraging only 2D data. Additionally, three novel synthetic datasets with over 1,000 scenes have been introduced for evaluating both 2D and 3D semantic segmentation, thus enabling the testing of generalizability across complex environments.

While NeSF sets a foundational benchmark, it faces limitations with smaller objects and thin structures due to current constraints in the spatial resolution of geometric reasoning and the absence of direct 2D visual cues in its inference stages. Future iterations may see integrations of 2D feature projection methods to refine segmentation accuracy further and exploit spatiotemporal sparsity for efficiency gains.

PDF Markdown