Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Published 3 Feb 2017 in cs.CV and cs.RO | (1702.01105v2)

Abstract: We present a dataset of large-scale indoor spaces that provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations. The dataset covers over 6,000m2 and contains over 70,000 RGB images, along with the corresponding depths, surface normals, semantic annotations, global XYZ images (all in forms of both regular and 360{\deg} equirectangular images) as well as camera information. It also includes registered raw and semantically annotated 3D meshes and point clouds. The dataset enables development of joint and cross-modal learning models and potentially unsupervised approaches utilizing the regularities present in large-scale indoor spaces. The dataset is available here: http://3Dsemantics.stanford.edu/

Abstract PDF Upgrade to Chat

Citations (821)

View on Semantic Scholar

Summary

The paper introduces a novel multi-modal dataset integrating 2D, 2.5D, and 3D semantic data for enhanced indoor scene analysis.
The dataset covers over 6,000 m² with 70,000+ RGB images, detailed 3D meshes, and consistent annotations across modalities for various vision tasks.
Baseline evaluations using a CRF-based method achieve a mean average precision of 49.93%, highlighting the dataset's potential for robust object detection.

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

The paper "Joint 2D-3D-Semantic Data for Indoor Scene Understanding" by Armeni et al. introduces a novel dataset designed for advancing the comprehension of indoor environments through the integration of multiple data modalities. This work addresses several limitations in existing datasets by providing a comprehensive collection of registered 2D, 2.5D, and 3D data, which includes detailed instance-level semantic annotations.

Dataset Overview

The dataset spans over 6,000 square meters and encompasses over 70,000 RGB images with corresponding depth maps, surface normals, semantic annotations, and global XYZ images, available in both regular and 360-degree equirectangular formats. Additionally, it provides raw and semantically annotated 3D meshes and point clouds. Notably, the dataset consists of 13 object classes and 11 scene categories, with annotations that are consistent across all modalities. This comprehensive collection facilitates various tasks including object detection, segmentation, scene reconstruction, and depth estimation.

Data Collection and Processing

The data was collected using Matterport Cameras, which use structured-light sensors to capture RGB and depth images during a 360-degree rotation at each scan location. This approach results in high-density 3D reconstructions, which are further processed to produce additional modalities such as surface normals and equirectangular images. Annotations are performed on the 3D point clouds, which are then projected onto 3D meshes and 2D images, ensuring consistency across all formats. This process takes advantage of the geometric context provided by 3D data to enhance the completeness and accuracy of the annotations, such as in occlusion handling and amodal detection.

Comparison with Existing Datasets

The dataset introduced in this paper distinguishes itself by its scale, diversity, and the number of modalities it provides. Compared to prominent datasets such as NYU Depth v2, SUN RGBD, and SceneNN, this dataset includes a more extensive array of registered modalities and a much larger number of images. The inclusion of equirectangular images and the ability to generate images from 3D mesh models offer significant advantages in terms of the potential breadth of data that can be synthesized and used for machine learning tasks.

Baseline Results and Evaluation

The authors present baseline results on a 3D object detection task using the provided dataset. The approach involves hierarchical semantic parsing of the large-scale data, followed by the implementation of supervised learning methods and contextual consistency checks using a Conditional Random Field (CRF). Quantitative results demonstrate the effectiveness of the dataset in improving object detection accuracy. The full model achieves a mean average precision (mAP) of 49.93% across all classes, highlighting the richness and utility of the annotated 3D data.

Implications and Future Directions

This dataset has significant implications for the development of joint and cross-modal learning models. The integration of various data types provides a richer context for machine learning algorithms, potentially leading to more robust and accurate models for indoor scene understanding. Future research could explore unsupervised learning approaches that leverage the regularities in large-scale indoor data, as well as the synthesis of new data modalities to further augment the dataset.

Moreover, the availability of consistent annotations across multiple data formats opens avenues for research in transfer learning and domain adaptation, where models trained on one modality can be adapted to perform tasks on another. This can be particularly useful in scenarios where acquiring labeled data is expensive or impractical.

In summary, the Joint 2D-3D-Semantic dataset presented by Armeni et al. provides a valuable resource for advancing the field of indoor scene understanding. Its comprehensive and richly annotated nature promises to catalyze the development of more sophisticated models and algorithms, while also offering new opportunities for exploring the interplay between different data modalities in machine learning.

Markdown Report Issue