Progressive Transformers for End-to-End Sign Language Production (2004.14874v2)

Published 30 Apr 2020 in cs.CV, cs.CL, and cs.LG

Abstract: The goal of automatic Sign Language Production (SLP) is to translate spoken language to a continuous stream of sign language video at a level comparable to a human translator. If this was achievable, then it would revolutionise Deaf hearing communications. Previous work on predominantly isolated SLP has shown the need for architectures that are better suited to the continuous domain of full sign sequences. In this paper, we propose Progressive Transformers, a novel architecture that can translate from discrete spoken language sentences to continuous 3D skeleton pose outputs representing sign language. We present two model configurations, an end-to-end network that produces sign direct from text and a stacked network that utilises a gloss intermediary. Our transformer network architecture introduces a counter that enables continuous sequence generation at training and inference. We also provide several data augmentation processes to overcome the problem of drift and improve the performance of SLP models. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging RWTH-PHOENIX-Weather-2014T(PHOENIX14T) dataset and setting baselines for future research.

Citations (109)

View on Semantic Scholar

Summary

The paper introduces a novel end-to-end model using progressive transformers to convert text into continuous 3D sign pose sequences.
It employs counter decoding and robust data augmentation techniques, such as future prediction and Gaussian noise, to mitigate model drift.
Evaluation via back-translation and BLEU scores shows that the T2P approach outperforms T2G2P, setting a new benchmark for sign language production.

Overview of "Progressive Transformers for End-to-End Sign Language Production"

The paper introduces a novel approach to Sign Language Production (SLP) by proposing Progressive Transformers, a system designed to translate spoken language into continuous 3D sign language sequences. This task addresses the complex requirements of translating discrete textual sentences into coherent sign language videos, a significant challenge in computational linguistics and computer vision.

Key Contributions

The authors present two configurations for the SLP task:

Text to Pose (T2P): An end-to-end model directly translating text to pose sequences without intermediate representations.
Text to Gloss to Pose (T2G2P): A stacked network utilizing an intermediary gloss representation, which is a written form of sign language components that aids in bridging the information between text and sign language pose sequences.

The paper employs a novel decoding methodology called "Counter Decoding," which allows for dynamic sequence length prediction, thus removing the need for predefined vocabulary in sequence generation. This is particularly valuable in producing continuous sequences from discrete input, a task known for structural differences in grammar and temporal length.

Data Augmentation and Model Robustness

An essential aspect of the paper is addressing model drift during sign language production. The authors implement several data augmentation techniques to counteract the drift:

Future Prediction: The model predicts multiple future frames at each step, encouraging robust sequence modeling.
Just Counter Input: Training where only counter values are used to curb reliance on skeletal inputs, reducing drift by forcing the model to generate new sequences from temporally-embedded data.
Gaussian Noise: Introduces noise to model robustness, forcing the model to adapt to varying input conditions.

The combination of these techniques results in enhanced model performance, leading to the generation of smoother and more accurate sign language sequences.

Evaluation and Results

The evaluation uses a back-translation approach to convert generated sign pose sequences back into textual form, assessing the quality of the translation via BLEU and ROUGE scores. Notably, the T2P configuration marginally outperformed the T2G2P setup, suggesting that the additional gloss step, while intuitively beneficial, may introduce unnecessary complexity for some data. The paper sets a benchmark by comparing its SLP outputs against other contemporary models, demonstrating improvements in BLEU-4 scores.

Implications and Future Directions

This work has substantial implications for improving communication accessibility for the deaf community by enhancing SLP systems. The potential to extend this model to include non-manual features such as facial expression and body language is significant, pointing toward truly comprehensive sign language translation tools that can integrate seamlessly into assistive technologies.

The paper also opens avenues for research in continuous sequence generation across other modalities, potentially impacting areas such as music synthesis and complex action recognition in video streams. As AI capabilities expand, the integration of such systems into real-world applications could significantly enhance interaction modes for non-verbal communication communities.

This paper delineates a pathway toward robust, interpretable, and accurate sign language generation, demonstrating the effectiveness of transformer architectures in bridging symbolic and continuous data realms. The benchmarks set in this work are vital for guiding future research and development in machine translation and SLP.

PDF Markdown

Related Papers

YouTube

Show All Videos