A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision (2405.10266v1)
Abstract: In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.
- A comprehensive study on sign language recognition methods. arXiv, 2020.
 - BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In Proc. ECCV, 2020.
 - Signer diarisation in the wild. In Technical Report, 2021a.
 - BOBSL: BBC-Oxford British Sign Language dataset. arXiv, 2021b.
 - Large lexicon project: American sign language video corpus and sign language indexing/retrieval algorithms. In LREC, 2010.
 - Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proc. ICCV, 2021.
 - Sign language recognition, generation, and translation: An interdisciplinary perspective. In ACM SIGACCESS, 2019.
 - Long term arm and hand tracking for continuous sign language TV broadcasts. In Proc. BMVC, 2008.
 - Automatic segmentation of sign language into subtitle-units. In ECCVW, 2020.
 - Aligning subtitles in sign language videos. In Proc. ICCV, 2021.
 - Neural sign language translation. In CVPR, 2018.
 - Multi-channel transformers for multi-articulatory sign language translation. In ECCVW, 2020a.
 - Sign language transformers: Joint end-to-end sign language recognition and translation. In CVPR, 2020b.
 - Content4all open research sign language translation datasets. arXiv, 2021.
 - Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
 - Fully convolutional networks for continuous sign language recognition. In ECCV, 2020.
 - CiCo: Domain-aware sign language retrieval via cross-lingual contrastive learning. In CVPR, 2023.
 - Pronouns and pointing in sign languages. Lingua, 137:230–247, 2013.
 - A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia, 2019.
 - Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
 - An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
 - Efficient approximations to model-based joint tracking and recognition of continuous sign language. In IEEE International Conference on Automatic Face and Gesture Recognition, 2008.
 - How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In CVPR, 2021.
 - Sign language video retrieval with free-form textual queries. In CVPR, 2022.
 - The via annotation software for images, audio and video. In Proc. ACMM, 2019.
 - Michael Filhol. Elicitation and corpus of spontaneous sign language discourse representation diagrams. In LREC, 2020.
 - Multi-modal transformer for video retrieval. In ECCV, 2020.
 - Thomas Hanke. HamNoSys - representing sign language data in language resources and language processing contexts. In LREC Workshop proceedings: Representation and processing of sign languages, 2004.
 - Video-based sign language recognition without temporal segmentation. In AAAI, 2018.
 - CoSign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition. In ICCV, 2023.
 - Hamid Reza Vaezi Joze and Oscar Koller. MS-ASL: A large-scale data set and benchmark for understanding American Sign Language. In BMVC, 2019.
 - Adam: A method for stochastic optimization. arXiv, 2014.
 - Neural sign language translation based on human keypoint estimation. Appl. Sci., 2019.
 - Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2003.
 - Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, 2015.
 - Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In WACV, 2019.
 - BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
 - Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In ICCV, 2021.
 - Use what you have: Video retrieval using representations from collaborative experts. In Proc. BMVC, 2019.
 - Video swin transformer. In CVPR, 2022.
 - Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
 - Watch, read and lookup: Learning to spot signs from multiple supervisors. In Proc. ACCV, 2020.
 - Automatic dense annotation of large-vocabulary sign language videosa. In Proc. ECCV, 2022.
 - Weakly-supervised fingerspelling recognition in british sign language videos. In Proc. BMVC, 2022.
 - Filtering, distillation, and hard negatives for vision-language pre-training. In arXiv, 2023.
 - Learning transferable visual models from natural language supervision. In ICML, 2021.
 - Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
 - Sign segmentation with temporal convolutional networks. In ICASSP, 2021a.
 - Sign segmentation with changepoint-modulated pseudo-labelling. In CVPRW. IEEE, 2021b.
 - Building the British sign language corpus. Language Documentation & Conservation, 7:136–154, 2013.
 - British Sign Language Corpus Project: A corpus of digital video data and annotations of British Sign Language 2008-2017 (Third Edition), 2017.
 - Open-domain sign language translation learned from online video. In EMNLP, 2022.
 - MPNet: Masked and permuted pre-training for language understanding. NeurIPS, 2020.
 - Videobert: A joint model for video and language representation learning. In ICCV, 2019.
 - Valerie Sutton. Lessons in sign writing, 1990. SignWriting.
 - Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 - Read and attend: Temporal localisation in sign language videos. In Proc. CVPR, 2021.
 - Attention is all you need. In NeurIPS, 2017.
 - The significance of facial features for automatic sign language recognition. In 8th IEEE International Conference on Automatic Face Gesture Recognition, 2008.
 - ActionCLIP: A new paradigm for video action recognition. arXiv:2109.08472, 2021.
 - Improving continuous sign language recognition with cross-lingual signs. In ICCV, 2023.
 - Purdue RVL-SLLL American sign language database. Technical Report, 2006.
 - Gloss attention for gloss-free sign language translation. In CVPR, 2023.
 - Coca: Contrastive captioners are image-text foundation models. arXiv, 2022.
 - A joint sequence fusion model for video question answering and retrieval. In ECCV, 2018.
 - C2ST: Cross-modal contextualized sequence transduction for continuous sign language recognition. In ICCV, 2023.
 - Using revised string edit distance to sign language video retrieval. In 2010 Second International Conference on Computational Intelligence and Natural Computing, pages 45–49. IEEE, 2010.
 - Gloss-free sign language translation: Improving from visual-language pretraining. In ICCV, 2023.
 - Improving sign language translation with monolingual data by sign back-translation. In CVPR, 2020.
 - C2SLR: Consistency-enhanced continuous sign language recognition. In CVPR, 2022.
 
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.