Contrastive Gaussian Clustering: Weakly Supervised 3D Scene Segmentation (2404.12784v1)

Published 19 Apr 2024 in cs.CV and cs.LG

Abstract: We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before $\alpha$ blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and $\alpha$ blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art. Code and trained models will be released soon.

References (1)

Sosuke Kobayashi and Eiichi Matsumoto and Vincent Sitzmann: Decomposing nerf for editing via feature field distillation. In: NeuIPS (2022)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a contrastive learning technique that embeds 3D feature fields into Gaussian splatting for enhanced scene segmentation.
It uses spatial regularization and contrastive loss to align inconsistent 2D masks across views, creating smoother segmentation boundaries.
The method achieves an 8% IoU boost over state-of-the-art models, advancing practical 3D scene understanding in autonomous and AR applications.

Exploring Contrastive Gaussian Clustering for 3D Scene Segmentation

Introduction

Recent advancements in 3D Scene Segmentation involve deriving elaborate models that integrate varying data input forms, notably transitioning 2D image understanding into 3D space through models that engage both geometric and semantic interpretations. Among these, 3D Gaussian Splatting (3DGS) techniques have arisen as a powerful approach due to their robustness in rendering and efficiency in computational performance. The paper introduces an innovative methodology baptized as Contrastive Gaussian Clustering which not only extends the capabilities of 3DGS to embrace scene segmentation tasks but also ensures consistency across views without the necessity for consistent input segmentation masks.

Methodology

The core innovation revolves around embedding a learnable 3D feature field within each Gaussian in a 3DGS model. This feature represents instance segmentation details, which are learned via a contrastive learning approach adapted to handle inconsistent 2D segmentation masks. The process involves:

3DGS Parameterization: The scene is represented by a collection of 3D Gaussians, each described by parameters related to their position, covariance, opacity, and additional segmentation features.
Feature Learning with Contrastive Loss: Contrastive learning is employed to align the 3D features with 2D segmentation masks, discretizing features into clusters that enhance segmentation consistency across views.
Spatial Regularization: To improve model robustness and provide contextual cohesion, a spatial regularization is applied. It ensures that nearby Gaussians in the 3D space share similar feature vectors, promoting smoother and more contiguous segmentation boundaries.

Results

The technique was rigorously validated against classical and contemporary benchmarks on diverse datasets. Notably, the use of contrastive clustering aids the model in achieving an 8% improvement in IoU accuracy over the state-of-the-art models. These results underscore the method's efficacy, particularly in handling complex, real-world scenes with varying object arrangements and occlusions.

Theoretical and Practical Implications

From a theoretical perspective, the infusion of contrastive learning within a 3D Gaussian representation framework for scene segmentation uncovers new avenues in semantically interpretable 3D scene analyses. Practically, this research could significantly enhance automated scene understanding in critical applications such as autonomous driving, augmented reality, and robotic navigation, where precise and reliable 3D segmentation is pivotal.

Future Directions

The introduced method, while robust, opens several research pathways. One potential area of exploration could be the reduction of computational overhead introduced by the segmentation feature vectors. Additionally, integrating richer semantic context or leveraging advancements in unsupervised learning could refine the segmentation outputs further, especially in dynamically changing environments. Another avenue could be exploring the fusion of linguistic models to provide semantic labels for the segmented clusters, enabling even more detailed scene descriptions and interactions.

Conclusion

Contrastive Gaussian Clustering represents a significant step forward in 3D scene segmentation. By effectively learning from inconsistent segmentation labels across views and providing high-quality segmentation output, it sets a new benchmark for future research in the domain of scene understanding and computer vision at large.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1782278522976752078

https://twitter.com/PAVIS_IIT/status/1785946036701175934

https://twitter.com/MatteoT90/status/1863669849928687934

https://twitter.com/CSVisionPapers/status/1782473446443630684