2000 character limit reached
Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection (2401.04868v1)
Published 10 Jan 2024 in cs.CL, cs.HC, cs.SD, and eess.AS
Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.
- Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 67:101178, 2021.
- Pauses, gaps and overlaps in conversations. Journal of Phonetics, 38(4):555–568, 2010.
- Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6(731):1–17, 2015.
- Gabriel Skantze. Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 220–230, 2017.
- Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks. In INTERSPEECH, pages 1661–1665, 2017.
- TurnGPT: A Transformer-based language model for predicting turn-taking in spoken dialog. In Empirical Methods in Natural Language Processing (EMNLP), pages 2981–2990, 2020.
- Response timing estimation for spoken dialog systems based on syntactic completeness prediction. In Spoken Language Technology Workshop (SLT), pages 369–374, 2023.
- Estimation of Listening Response Timing by Generative Model and Parameter Control of Response Substantialness Using Dynamic-Prompt-Tune. In INTERSPEECH, pages 2638–2642, 2023.
- Multimodal turn-taking model using visual cues for end-of-utterance prediction in spoken dialogue systems. In INTERSPEECH, pages 2658–2662, 2023.
- Optimizing the turn-taking behavior of task-oriented spoken dialog systems. ACM Transactions on Speech and Language Processing, 9(1):1–23, 2012.
- Attentive listening system with backchanneling, response generation and flexible turn-taking. In Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial), pages 127–136, 2017.
- Evaluation of real-time deep learning turn-taking models for multiple dialogue scenarios. In International Conference on Multimodal Interaction (ICMI), pages 78–86, 2018.
- Voice Activity Projection: Self-supervised learning of turn-taking events. In INTERSPEECH, pages 5190–5194, 2022.
- Erik Ekstedt. Predictive Modeling of Turn-Taking in Spoken Dialogue: Computational Approaches for the Analysis of Turn-Taking in Humans and Spoken Dialogue Systems. PhD thesis, KTH Royal Institute of Technology, 2023.
- Unsupervised pretraining transfers well across languages. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418, 2020.
- Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2023.
- Collection and analysis of travel agency task dialogues with age-diverse speakers. In Language Resources and Evaluation Conference (LREC), pages 5759–5767, 2022.
- SWITCHBOARD: Telephone speech corpus for research and development. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 517–520, 1992.
- HKUST/MTS: A very large scale mandarin telephone speech corpus. In International Symposium Chinese Spoken Language Processing (ISCSLP), pages 724–735, 2006.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.