Neural Sign Actors: A diffusion model for 3D sign language production from text (2312.02702v2)

Published 5 Dec 2023 in cs.CV

Abstract: Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities. Deep learning methods for SL recognition and translation have achieved promising results. However, Sign Language Production (SLP) poses a challenge as the generated motions must be realistic and have precise semantic meaning. Most SLP methods rely on 2D data, which hinders their realism. In this work, a diffusion-based SLP model is trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.

References (62)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a diffusion-based model that converts text into 3D sign language sequences.
It employs a graph neural network built on the SMPL-X skeleton with pose optimization for dynamic and realistic sign representation.
Evaluation against benchmarks and user studies demonstrates superior accuracy in hand, body, and facial articulation compared to previous methods.

Significance and Challenges of Sign Language Production (SLP)

Sign language is the primary mode of communication for the Deaf and Hard of Hearing communities. Despite advancements in recognition and translation, producing realistic sign language through computer vision poses significant challenges. Many existing methods depend on 2D data, limiting their ability to capture the full complexity of sign language, which features a combination of manual gestures and non-manual elements like facial expressions and body movements.

Innovative Approach to 3D Sign Language Production

In an effort to enhance the field of Sign Language Production, this paper introduces a new model designed to generate three-dimensional sign language sequences from text input, utilizing a diffusion-based process. The model employs a unique graph neural network built upon the anatomically detailed SMPL-X skeleton, enabling dynamic and anatomically correct representation of sign language avatars.

Creation of a Comprehensive 3D Dataset

To support the training of the model, researchers have developed the first large-scale dataset of 3D sign language, annotated with detailed SMPL-X parameters. The dataset is derived from the existing How2Sign dataset and includes high-fidelity reconstructions of signing avatars paired with their text transcripts. The reconstruction pipeline surpasses previous methods in accuracy by applying a novel pose optimization constrained by realistic human pose priors.

Evaluation and Impact

The model undergoes rigorous testing against several benchmarks, showcasing superior performance over current state-of-the-art approaches in generating sign language from text. This includes improved accuracy in hand articulations and body movements, as well as better alignment with text meaning. A user paper involving individuals fluent in American Sign Language further validates the model's efficacy, with generated signs achieving high accuracy in reflecting the intended message.

In summary, the paper presents an advancement in bridging the communication gap for the Deaf and Hard of Hearing, with a text-to-sign generation model that produces more realistic signing avatars. This progress highlights the potential of diffusion models and graph neural networks in improving accessibility through technology.

PDF Markdown