Emergent Mind

OpenScene: 3D Scene Understanding with Open Vocabularies

(2211.15654)
Published Nov 28, 2022 in cs.CV

Abstract

Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.

Open-vocabulary 3D scene exploration highlighting properties, surface materials, and activity sites based on queries.

Overview

  • OpenScene introduces a zero-shot approach for 3D scene understanding using large-scale multi-modal pre-trained models to enable open-vocabulary queries on 3D scenes.

  • The method co-embeds 3D point features with text and image pixels, leveraging 2D and 3D feature fusion to perform various tasks like semantic segmentation and object retrieval without labeled 3D data.

  • Extensive experiments demonstrate the method's effectiveness, showing strong performance in zero-shot 3D semantic segmentation, outperforming existing methods, and proving robust across a range of applications and datasets.

OpenScene: 3D Scene Understanding with Open Vocabularies

This paper introduces OpenScene, a novel zero-shot approach for 3D scene understanding that leverages large-scale multi-modal pre-trained models like CLIP to achieve open-vocabulary queries on 3D scenes. The core idea involves predicting dense features for 3D points that exist in a shared feature space with text and images, allowing for flexible and generalized scene understanding without the need for labeled 3D data. This method enables a range of applications in 3D semantic segmentation, object retrieval, scene exploration, and more.

Methodology

The core innovation in OpenScene is the co-embedding of 3D point features with text and image pixels in the CLIP feature space. This method circumvents traditional supervised learning, which relies on substantial labeled datasets, by utilizing pre-trained image-text models to predict features for 3D points. The following are the key components of the proposed methodology:

2D Feature Extraction and Fusion:

  • The method begins by extracting per-pixel features from 2D images using a pre-trained open-vocabulary segmentation model such as OpenSeg or LSeg.
  • These 2D features are then back-projected onto 3D surface points using multiple views to form a comprehensive feature vector for each 3D point.

3D Feature Distillation:

  • A 3D network, instantiated as MinkowskiNet, is trained to reproduce the fused 2D features using solely 3D point clouds as input.
  • A cosine similarity loss ensures the consistent alignment of the 3D feature representations with their 2D counterparts.

2D-3D Feature Ensemble:

  • During inference, the method ensembles the features from both 2D and 3D domains to form a robust feature representation for each 3D point.
  • This is achieved by correlating the features with text prompts through cosine similarity and selecting the highest scoring feature.

Open-Vocabulary Inference:

Experimental Results

The authors present extensive experiments showcasing the effectiveness of the method across several benchmarks, including ScanNet, Matterport3D, and nuScenes.

Zero-Shot 3D Semantic Segmentation:

OpenScene achieves notable performance gains compared to other zero-shot methods such as MSeg Voting and 3DGenz. For instance, on the ScanNet dataset, OpenScene yields an mIoU of 62.8% on unseen classes, significantly higher than the 7.7% achieved by 3DGenz.

Closed-Set 3D Semantic Segmentation:

Although not matching state-of-the-art fully supervised methods, OpenScene's performance is competitive with several-year-old supervised methods. The inclusion of 2D-3D ensemble features particularly shines, with marginal performance gaps on diverse datasets such as Matterport3D.

Handling Long-Tailed Classes:

The method shows robustness in handling an increasing number of classes, especially for less common and fine-grained categories. This is evidenced by better performance metrics as the number of classes scales up, outperforming fully-supervised baselines in long-tailed class distributions.

Application Scenarios

OpenScene also demonstrates versatile applications beyond standard semantic segmentation:

  1. 3D Object Retrieval: The method effectively retrieves specific object instances within a 3D scene based solely on text queries. The retrieval accuracy suggests high precision in identifying objects, even those with challenging or specific descriptors.

  2. Image-Based 3D Object Detection: By querying the 3D scene using example images, the method successfully retrieves relevant 3D points matching the input images, thereby validating the co-embedding's robustness.

  3. Abstract Scene Exploration: The flexibility of the approach is highlighted through queries about material properties, activities, and abstract concepts. Queries such as "comfy," "metal," and "play" yield semantically meaningful segmentations that traditional methods would struggle to achieve without extensive retraining.

Implications and Future Work

The implications of OpenScene are both practical and theoretical. Practically, it democratizes 3D scene understanding by reducing the dependency on labeled 3D datasets, making it easier to adapt to new environments and tasks. Theoretically, it introduces a scalable approach for applying large pre-trained models to 3D data, bridging the gap between 2D and 3D semantic understanding.

Future developments could explore enhancing the fusion of 2D and 3D features, optimizing performance in real-time applications, and expanding the methodology to more complex tasks such as dynamic scene understanding and task-oriented object manipulation in robotics.

In conclusion, OpenScene represents a promising direction in 3D scene understanding, leveraging the power of large-scale pre-trained models to perform a wide array of tasks without the need for extensive 3D annotations. The potential of such approaches to generalize across domains while maintaining high performance indicates a significant step forward in AI-driven scene understanding.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.