Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding (2401.09340v3)

Published 17 Jan 2024 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.RO

Abstract: 3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.

References (98)

Citations (28)

View on Semantic Scholar

Summary

The paper introduces SceneVerse, the first large-scale dataset with 2.5M scene-language pairs for 3D vision-language learning.
It presents the GPS model that employs contrastive learning over scene-text alignments without relying on complex architectures.
Experiments reveal that scaling data significantly enhances performance, enabling robust zero-shot generalization in scene understanding.

Overview of SceneVerse and GPS Framework

Embodied AI, which combines 3D spatial understanding with natural language processing, is critical for the development of robots and systems that can navigate and interact in real-world spaces. However, aligning language with 3D physical environments presents significant hurdles due to the complex nature of 3D data and the scarcity of structured learning datasets. Addressing these challenges, a dataset named SceneVerse offers a revolutionary leap in 3D vision-language learning, and with it comes a new training model known as Grounded Pre-training for Scenes (GPS).

The Problem with 3D Vision-Language Grounding

Integrating language with 3D environments is more challenging than in 2D due to inherent complexities. The rich attributes of objects, their diverse configurations, and the intricate relationships they share heavily complicate scene understanding. Moreover, the data required for training AI in this domain is sparse and crafting a unified learning framework has not been achieved—until now.

Introducing SceneVerse and GPS

SceneVerse stands as the first expansive dataset created for 3D vision-language learning, featuring a staggering 68,406 indoor scenes complemented by a whopping 2.5 million scene-language pairs. This monumental scale is made possible by combining human annotations with a scalable approach for automated generation of scene descriptions using scene graphs and LLMs.

Exploring GPS Capabilities through Extensive Experiments

Alongside SceneVerse, the GPS model has been unveiled, trained across different levels of scene-text alignment through contrastive learning. Counter to other models, GPS doesn’t rely on additional complex structures; rather, it simplifies the training process yet achieves professional-grade performance. The model's capability for zero-shot generalization in varied 3D vision-language tasks indicates the effectiveness of the underlying data and framework.

Potential Exposed Through Data Scaling and Model Generalization

A series of experiments reveal that as data scales up, GPS consistently improves, suggesting a strong correlation between data volume and model proficiency. Additionally, GPS’s ability to adapt knowledge from SceneVerse to apply to unseen scenarios, known as zero-shot transfer, highlights the model's potential and the dataset's robustness. It underscores the value of SceneVerse as offering a rich training ground for future research in 3D vision-language tasks.

PDF Markdown

GitHub

SceneVerse
GitHub - scene-verse/SceneVerse (246 stars)

Tweets

https://twitter.com/_akhaliq/status/1747862771851514245

https://twitter.com/knishimae0531/status/1748127997952196635

https://twitter.com/javaeeeee1/status/1747960116425470042