Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models (2207.11514v2)

Published 23 Jul 2022 in cs.CV and cs.RO

Abstract: We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-LLMs (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary, materials/lighting, classes, and domains (i.e., real-world scans) from training on limited 3D synthetic data. Code and data is available at https://semantic-abstraction.cs.columbia.edu/

Citations (88)

View on Semantic Scholar

Summary

The paper introduces Semantic Abstraction (SemAbs), transforming 2D relevancy maps into 3D spatial representations for robust scene understanding.
It integrates a semantic-aware wrapper with a 3D module, achieving up to 2x improvement in IoU metrics and a 60x speedup over conventional methods.
The framework excels in open-vocabulary semantic scene completion and visually obscured object localization, promising enhanced robotic perception in unstructured environments.

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-LLMs

Overview

This paper introduces "Semantic Abstraction" (SemAbs), a framework devised to enhance open-world 3D scene understanding by augmenting 2D Vision-LLMs (VLMs) with 3D spatial capabilities. SemAbs leverages the strength of 2D VLMs, such as CLIP, and extends their application to more complex 3D environments, critical for robotic operations in dynamic, unstructured settings. The core innovation lies in transforming 2D relevancy maps into 3D spatial representations, allowing for efficient and generalized 3D scene understanding.

Methodology

SemAbs integrates two significant submodules: a semantic-aware wrapper and a semantic-abstracted 3D module. The semantic-aware wrapper uses relevancy maps extracted from the 2D VLMs to generate abstracted representations of objects in a given RGB-D image. These relevancy maps are projected into 3D point clouds, effectively abstracting the objects' semantic labels. The semantic-abstracted 3D module then completes the 3D spatial and geometric representation of these objects. This 3D representation covers object geometry and potential locations, accommodating unseen vocabulary and new domains significantly well.

Applications and Tasks

SemAbs is evaluated through two primary tasks: Open-Vocabulary Semantic Scene Completion (OVSSC) and Visually Obscured Object Localization (VOOL).

OVSSC: This task involves completing the 3D geometry of partially observed objects in a scene, which requires recognizing and understanding object attributes from a varied and unseen set of semantic categories. The approach effectively utilizes abstracted relevancy maps to fill in missing geometric details of objects in new and varied environments.
VOOL: Here, the framework localizes hidden objects based on language descriptions, which are often nuanced and context-dependent. The model combines spatial embeddings of closed-vocabulary spatial relations with generalized semantic understanding to locate objects that may not be directly visible in the input scene.

Experimental Results

The experiments highlight several important results:

Generalization: SemAbs demonstrates robust generalization capabilities, outperforming baseline models in scenarios involving novel rooms, visual properties, synonyms, and object classes. Specifically, it maintains high accuracy even when dealing with previously unseen objects and descriptions.
Accuracy: On tasks requiring both 3D spatial reasoning and semantic understanding, SemAbs outperformed several baselines, achieving up to a 2x improvement in IoU metrics for various generalization categories.
Efficiency: The paper proposes a multi-scale relevancy extractor, which improves the speed and accuracy of processing relevancy maps, achieving a 60x speedup over conventional methods.

Implications and Future Directions

SemAbs provides a robust pathway for enhancing robotic perception capabilities in real-world, unstructured environments. By abstracting the semantic load to pre-trained, high-capacity 2D VLMs, and specializing in spatial reasoning, the framework efficiently utilizes limited 3D training data while maintaining robustness across different domains.

Future research could benefit from exploring more intricate natural language understanding capabilities to incorporate richer spatial descriptions. Additionally, improving the quality and granularity of relevancy maps, potentially by integrating newer and more robust VLMs, could further enhance the system's efficacy.

Conclusion

Semantic Abstraction represents a significant progression in the intersection of 2D VLMs and 3D scene understanding. It leverages the robust visual-semantic reasoning of VLMs for complex 3D spatial tasks, demonstrating strong potential for practical applications in robotics. The framework's ability to generalize across a wide range of scenarios marks a notable advance in open-world 3D scene understanding, laying the groundwork for future innovations in AI-driven robotic perception and interaction.

PDF Markdown