Papers
Topics
Authors
Recent
2000 character limit reached

WLASL-LEX: a Dataset for Recognising Phonological Properties in American Sign Language

Published 11 Mar 2022 in cs.CL, cs.CV, and cs.LG | (2203.06096v1)

Abstract: Signed Language Processing (SLP) concerns the automated processing of signed languages, the main means of communication of Deaf and hearing impaired individuals. SLP features many different tasks, ranging from sign recognition to translation and production of signed speech, but has been overlooked by the NLP community thus far. In this paper, we bring to attention the task of modelling the phonology of sign languages. We leverage existing resources to construct a large-scale dataset of American Sign Language signs annotated with six different phonological properties. We then conduct an extensive empirical study to investigate whether data-driven end-to-end and feature-based approaches can be optimised to automatically recognise these properties. We find that, despite the inherent challenges of the task, graph-based neural networks that operate over skeleton features extracted from raw videos are able to succeed at the task to a varying degree. Most importantly, we show that this performance pertains even on signs unobserved during training.

Citations (9)

Summary

  • The paper introduces WLASL-LEX, a large-scale dataset that explicitly labels six key phonological properties in ASL videos.
  • It employs skeleton keypoints from HRNet and FrankMocap, achieving up to 84.5% accuracy on sign type classification with graph-based models.
  • The study demonstrates that sublexical modeling can generalize to unseen glosses, paving the way for zero-shot sign recognition in low-resource scenarios.

WLASL-LEX: Phonological Property Recognition Dataset for American Sign Language

Motivation and Context

Sign language processing (SLP) presents distinct challenges relative to spoken language, due to the primary reliance on visuo-gestural communication and complex multimodal phonology. In American Sign Language (ASL), a small, linguistically well-defined phonological inventory—composed of manual and non-manual features—serves as the sublexical basis for sign formation. While phonological analysis is central in spoken language technology, SLP research has only recently turned toward modeling sign phonology as a core objective rather than as an implicit byproduct of data-driven pattern recognition. The scarcity of large-scale annotated phonological datasets, and limited diversity in lexical resources, has hindered progress. This paper addresses this gap by formalizing Phonological Property Recognition (PPR) for ASL and assembling WLASL-LEX, a scalable dataset with six annotated phonological classes.

Dataset Construction and Properties

WLASL-LEX is constructed by cross-referencing glosses found in the WLASL video dataset (over 20,000 sign videos from 100+ signers, covering more than 2,000 glosses) with the manually curated ASL-LEX lexical database that provides explicit phonological annotation [Caselli et al., 2017]. The authors select the six manual phonological properties with high discriminatory power for sign identity:

  • Flexion: Aperture of selected fingers
  • Major Location: Broad region of the body or signer
  • Minor Location: Specific anatomical site
  • Movement: Path movement of the dominant hand
  • Selected Fingers: Foregrounded fingers during sign production
  • Sign Type: Symmetry/asymmetry of hand usage

Videos are automatically labeled with these properties, resulting in 10,017 videos across 800 glosses. The dataset incorporates upper-body skeleton keypoints extracted using two established pose estimation tools (FrankMocap for 3D and HRNet for 2D), providing spatial-temporal feature representations optimized for SLP architectures.

The dataset exhibits expected skewness in class distribution (long tail), motivating the sampling and evaluation strategies.

Experimental Setup

PPR is operationalized as a multi-class classification task, with each of the six phonological properties treated as an independent prediction problem. The evaluation considers two key splits:

  • Phoneme split: Train/test partitions stratified on the phonological property, allowing overlapping glosses across splits (subject to signer division).
  • Gloss split: Stratification ensures no gloss overlap between train and test, assessing the true generalization to unseen lexical items.

This dual scheme gauges the extent to which a model can learn canonical phonological representations that transfer to unobserved signs.

Modeling and Results

A range of neural models is evaluated. Baselines include simple MLPs and RNNs applied to sequences of skeleton features. More advanced models involve:

  • STGCN: Spatio-temporal graph convolutional networks operating over skeleton graphs
  • I3D: Inflated 3D ConvNet for end-to-end learning directly from videos

Key findings:

  • STGCN with HRNet features outperforms all other methods across all six phonological properties, with accuracy gains up to 84.5% in the per-class test accuracy (Phoneme split, SignType).
  • Performance drops when models are tested on unseen glosses (Gloss split), but the best architecture (STGCN) retains high accuracy, with most properties degrading by less than 10%—demonstrating above-chance structural generalization.
  • Simple MLPs and RNNs provide only marginal improvement over the (high) majority-class baseline, especially for properties with uneven representation or more subtle spatial components.
  • Models leveraging skeleton features exhibit stronger performance than those operating on raw pixels, suggesting the importance of explicit structural encoding.

The study further investigates potential data noise or annotation ceiling effects and finds low model-model agreement on systematic errors, indicating that gains are not bottlenecked by label quality or input fidelity.

Theoretical and Practical Implications

The explicit demonstration that neural architectures can learn to predict ASL phonological properties—including generalization to out-of-vocabulary signs—substantiates the utility of phonological supervision in SLP. Notably, the abstraction away from lexical memorization toward sublexical feature modeling establishes a path for zero-shot sign recognition in low-resource SL contexts and motivates cross-linguistic phonology transfer.

The current approach treats each phonological property in isolation; future work that jointly models the interdependencies among these properties may further exploit phonotactic constraints inherent to signed languages. Moreover, leveraging PPR as a building block for robust sign tokenization, glossing of continuous SLP, and the automatic creation of silver-standard annotations for new SLs, will likely have significant downstream impact. Such advances could address data bottlenecks in SL processing, leading to improved translation and accessible communication tools.

Outlook on Future Directions

Key open directions include:

  • Multi-task models for joint phonological property learning, capturing inter-property constraints.
  • Transfer learning and zero-shot recognition leveraging shared phonological subunits across sign languages.
  • Integration of non-manual (facial/expression/eyebrow) features, currently excluded, for a more holistic phonological representation.
  • Systematic benchmarking against human expert annotation reliability in PPR classification tasks.
  • Use of PPR for automated segmentation of continuous signing, facilitating robust gloss boundary detection.

Conclusion

The WLASL-LEX paper systematizes phonological property recognition for ASL, providing a large-scale, diverse, and automatically labeled dataset. The empirical results underscore that graph-based models using structured skeleton features can reliably predict key phonological attributes, even for unseen lexical items, pointing to compositional inductive generalization in SLP architectures. This work constitutes a foundational resource and methodology for fine-grained, interpretable SL modeling and for scaling SLP across languages and domains (2203.06096).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.