How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language

Published 18 Aug 2020 in cs.CV | (2008.08143v2)

Abstract: One of the factors that have hindered progress in the areas of sign language recognition, translation, and production is the absence of large annotated datasets. Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation. To evaluate the potential of How2Sign for real-world impact, we conduct a study with ASL signers and show that synthesized videos using our dataset can indeed be understood. The study further gives insights on challenges that computer vision should address in order to make progress in this field. Dataset website: http://how2sign.github.io/

Abstract PDF Upgrade to Chat

Authors (8)

Citations (169)

View on Semantic Scholar

Summary

The paper introduces a comprehensive dataset with 80+ hours of continuous ASL videos, enabling advanced sign language recognition and synthesis research.
It employs dual recording environments with HD green screen and 3D motion capture to ensure precise 2D and 3D pose estimations.
The dataset aligns multimodal inputs such as speech, transcripts, gloss annotations, and depth data, enhancing context-aware machine learning models.

A Multimodal Resource for Advancing Sign Language Processing: The How2Sign Dataset

The How2Sign dataset represents a significant development in the resources available for the study and advancement of sign language processing technologies. This work addresses the crucial barrier of data scarcity that impedes advances in sign language recognition, translation, and production by introducing a comprehensive multimodal collection of continuous American Sign Language (ASL) data. With over 80 hours of annotated ASL videos accompanied by parallel modalities such as speech, transcripts, and depth information, How2Sign marks a substantial contribution to linguistics and computer vision research communities involved in sign language studies.

Dataset Composition and Methodology

The How2Sign dataset aggregates more than 80 hours of multiview ASL video recordings annotated with parallel modalities—English transcripts, gloss annotations, and depth data. This diverse collection of data was achieved through collaboration with native ASL signers and interpreters, ensuring data quality and linguistic accuracy. The dataset includes recordings from two distinct environments: a Green Screen studio and the Panoptic Studio, the latter facilitating detailed 3D pose estimation via its geodesic dome equipped with numerous cameras and sensors.

The Green Screen captures enabled multiview recordings with HD and depth cameras, allowing for robust 2D keypoint analysis using OpenPose software. A subset of this footage was recorded in the Panoptic studio, providing data enriched with 3D keypoints, advantageous for fine-grained 3D motion analysis and sign language semantics capture.

The dataset's construction adheres to high standards of varying input, encompassing a diverse vocabulary stemming from its alignment with the instructional How2 dataset. This alignment not only provides synchronicity across modalities but enables researchers to draw upon extensive pre-existing linguistic resources in their analysis and model-building endeavors.

Potential Impact and Evaluation

The implications of the How2Sign dataset extend across multiple dimensions in sign language processing research. Firstly, the dataset's breadth facilitates the training of machine learning models capable of transcending the limitations imposed by previous smaller, domain-restricted datasets. In machine translation and sign language synthesis research, How2Sign delivers the multimodal support necessary for developing more sophisticated, context-aware models.

A study conducted to gauge the real-world applicability of synthesized videos from How2Sign data showcased promising results in terms of participant comprehension of rendered ASL clips, indicating the dataset’s potential to support models in producing interpretable, realistic sign language outputs. The study emphasizes the need for improved human pose estimation, especially concerning fast-moving gestures typical in sign language, to capture granularity in meaning.

Future Directions and Research Needs

While How2Sign provides a solid foundation, continuous improvements to pose estimation algorithms remain crucial for advancing sign language recognition and synthesis technologies. The dataset paves the way for research on signer-independent models, exploring personalized signing nuances and impact on communication. Additionally, the potential for developing bilingual or multilingual sign language systems may be realized by expanding this data-driven approach to other sign languages, enhancing accessibility for Deaf communities worldwide.

In conclusion, How2Sign offers a valuable open-resource platform with its comprehensive, multimodal data reflecting real-world variability in sign language. By bridging a significant data gap, How2Sign promotes rigorous advancements in computational sign language tasks, fostering enhancements in speech-to-sign, video-to-text, and realistic sign generation that can significantly enhance communication accessibility for sign language users.

Markdown Report Issue