Emergent Mind

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

(2401.09340)
Published Jan 17, 2024 in cs.CV , cs.AI , cs.CL , cs.LG , and cs.RO

Abstract

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.

Model performance improves with data scaling in pre-train and zero-shot settings on ScanRefer and SceneVerse-val.

Overview

  • SceneVerse is a new dataset for 3D vision-language learning, featuring 68,406 indoor scenes and 2.5 million scene-language pairs.

  • Grounded Pre-training for Scenes (GPS) is a training model that uses contrastive learning and allows AI to better align language with 3D physical environments.

  • 3D vision-language grounding is inherently complex due to the detailed nature of 3D scenes and the scarcity of appropriate datasets.

  • GPS demonstrates professional-grade performance without relying on complex additional structures and can generalize to new tasks without prior exposure (zero-shot generalization).

  • Experiments show that larger quantities of data lead to improved performance, and the ability to transfer learning to new situations marks GPS and SceneVerse as valuable for future embodied AI research.

Overview of SceneVerse and GPS Framework

Embodied AI, which combines 3D spatial understanding with natural language processing, is critical for the development of robots and systems that can navigate and interact in real-world spaces. However, aligning language with 3D physical environments presents significant hurdles due to the complex nature of 3D data and the scarcity of structured learning datasets. Addressing these challenges, a dataset named SceneVerse offers a revolutionary leap in 3D vision-language learning, and with it comes a new training model known as Grounded Pre-training for Scenes (GPS).

The Problem with 3D Vision-Language Grounding

Integrating language with 3D environments is more challenging than in 2D due to inherent complexities. The rich attributes of objects, their diverse configurations, and the intricate relationships they share heavily complicate scene understanding. Moreover, the data required for training AI in this domain is sparse and crafting a unified learning framework has not been achieved—until now.

Introducing SceneVerse and GPS

SceneVerse stands as the first expansive dataset created for 3D vision-language learning, featuring a staggering 68,406 indoor scenes complemented by a whopping 2.5 million scene-language pairs. This monumental scale is made possible by combining human annotations with a scalable approach for automated generation of scene descriptions using scene graphs and LLMs.

Exploring GPS Capabilities through Extensive Experiments

Alongside SceneVerse, the GPS model has been unveiled, trained across different levels of scene-text alignment through contrastive learning. Counter to other models, GPS doesn’t rely on additional complex structures; rather, it simplifies the training process yet achieves professional-grade performance. The model's capability for zero-shot generalization in varied 3D vision-language tasks indicates the effectiveness of the underlying data and framework.

Potential Exposed Through Data Scaling and Model Generalization

A series of experiments reveal that as data scales up, GPS consistently improves, suggesting a strong correlation between data volume and model proficiency. Additionally, GPS’s ability to adapt knowledge from SceneVerse to apply to unseen scenarios, known as zero-shot transfer, highlights the model's potential and the dataset's robustness. It underscores the value of SceneVerse as offering a rich training ground for future research in 3D vision-language tasks.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.