Emergent Mind

Abstract

We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. We aim to accelerate research on egocentric hand-object interaction by making the HOT3D dataset publicly available and by co-organizing public challenges on the dataset at ECCV 2024. The dataset can be downloaded from the project website: https://facebookresearch.github.io/hot3d/.

Motion-capture lab setup with infrared OptiTrack cameras and light diffuser panels for the HOT3D dataset.

Overview

  • HOT3D provides a rich, egocentric dataset for 3D hand and object tracking with over 3.7 million meticulously annotated images captured using head-mounted devices from Meta Reality Labs.

  • The dataset includes interactions with 33 diverse objects across various everyday scenarios, employing sophisticated motion-capture systems and detailed 3D models for robust rendering and evaluation.

  • HOT3D aims to advance AR/VR systems by enhancing model training with high-resolution ground-truth data and multimodal sensory inputs, enabling more intuitive human-computer interactions.

Overview of HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking

HOT3D is a substantial contribution to the domain of egocentric hand and object tracking in 3D computer vision. The dataset, created by researchers at Meta Reality Labs, encompasses over 833 minutes of multi-view image streams, amounting to more than 3.7 million images captured through head-mounted devices. This dataset is meticulously annotated with detailed ground-truth information on the 3D poses and models of both hands and objects, thus offering a comprehensive resource for the development and evaluation of advanced tracking algorithms.

Dataset Composition and Tools

The dataset involves 19 subjects interacting with 33 diverse rigid objects across various everyday scenarios, including activities performed in kitchen, office, and living room settings. The recordings are executed using two proprietary head-mounted devices: Project Aria and Quest 3. Specifically, Project Aria provides a rich set of multimodal signals such as 3D point clouds and eye gaze information, thanks to its high-resolution RGB and monochrome camera configuration. Quest 3, on the other hand, provides synchronized monochrome image streams conducive for robust tracking in VR applications.

Ground-truth annotations of hand and object poses were obtained utilizing a sophisticated motion-capture system, ensuring high-quality and consistent labeling throughout the dataset. This is complemented by 3D mesh models of the objects, scanned in-house to include detailed geometry and PBR (Physically-Based Rendering) materials, facilitating realistic rendering and comprehensive evaluation in synthetic settings.

Key Numerical Results and Comparative Analysis

The HOT3D dataset’s scale and the quality of its annotations set it apart from preceding datasets like HO-3D, H2O, and ContactPose. Specifically, it includes:

  • 833 minutes of recordings, leading to over 1.5 million multi-view frames
  • Accurate annotation of over 3.7 million images with ground-truth 3D poses
  • Inclusion of dynamic, non-trivial scenarios with detailed information about hand and object interactions

This dataset details precise hand pose annotations using the UmeTrack and MANO formats and object pose annotations represented as 3D rigid transformations. The availability of diverse hand interactions with everyday objects in varied and dynamic settings provides a rich testbed for practical applications in AR/VR and human-computer interaction contexts.

Implications and Future Directions

HOT3D stands to significantly accelerate research in several key areas:

  1. Egocentric Interaction Understanding: By providing detailed annotations and multimodal data, this dataset aids in developing models that understand and interpret complex hand-object interactions from a first-person perspective.
  2. Model-Free and Model-Based Tracking: The comprehensive object onboarding sequences facilitate the training and testing of both model-based and model-free 3D object tracking algorithms.
  3. Synthetic Data Generation: Detailed 3D object models with PBR materials enable the generation of photo-realistic synthetic datasets that can further enhance model training.

For future developments, HOT3D paves the way to more intelligent and responsive AR/VR systems that can accurately interpret and predict user intent based on detailed hand and object movement analysis. The coupling of high-resolution ground-truth annotations with multimodal sensory inputs opens up new possibilities in contextual AI, where machine understanding of user actions can lead to more seamless and intuitive interactions.

Conclusion

HOT3D is a meticulously curated and richly annotated dataset that offers significant resources for the advancement of egocentric hand and object tracking in 3D. Through its extensive recordings, detailed ground-truth data, and versatile applications, it holds promise for pushing the boundaries of computer vision research and enabling more sophisticated AI-driven applications in augmented and virtual reality contexts. Researchers are encouraged to leverage this dataset for developing and validating next-generation tracking algorithms, contributing to the ongoing evolution of computer vision and interactive AI systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.