Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection (2401.04868v1)

Published 10 Jan 2024 in cs.CL, cs.HC, cs.SD, and eess.AS

Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

References (19)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a voice activity projection model using contrastive predictive coding and transformer networks to predict speaker transitions in real time.
The methodology effectively balances prediction accuracy with processing speed, demonstrating robust performance on a standard CPU even with shortened input sequences.
The study shows promising applications in enhancing spoken dialogue systems by accurately anticipating conversational turn shifts for smoother human-machine interaction.

Introduction to Turn-taking Prediction

Turn-taking is an integral part of communication, where predicting when a speaker will start or stop speaking is key to facilitating smooth dialogue exchanges. Traditional spoken dialogue systems (SDSs) often rely on fixed silence thresholds to determine speaker transitions, which can lead to interruptions or delays in responses. This approach can overlook the complexities involved in human conversational patterns, where pauses are not necessarily indicative of turn-end.

Voice Activity Projection Model

Addressing these challenges, a new framework for turn-taking prediction has been introduced, known as voice activity projection (VAP). This innovative model directly maps stereo audio of dialogue to future voice activities using contrastive predictive coding (CPC) and transformer networks. It functions by anticipating near future voice activities, processing the audio signal via multi-layer transformers, and utilizing a multitask learning approach for both voice activity projection (predicting future speaking probability) and detection (identifying current voice activity).

The VAP System in Action

The performance of the VAP system has been examined and its potential in real-time applications showcased. Crucially, the model's ability to function effectively in real-time scenarios using only a CPU has been confirmed. The paper explored various input sequence lengths, demonstrating that a balance can be struck between maintaining high accuracy in prediction and achieving the necessary processing speed for real-time interaction. The VAP model, through an explanatory GUI, accurately indicates the likelihood of a speaker either continuing their turn or transitioning to the other participant, including during moments of uncertainty where next speaker is not immediately clear.

Future Directions and Conclusion

The presented work demonstrates the possibility of enhancing SDSs with a level of turn-taking nuance previously unattainable with simple silence timeout thresholds. With robust performance even with reduced input sequence lengths, the VAP model emerges as a promising solution for real-time SDSs. Future pursuits include further integration of the VAP system into various dialogue systems and extensive testing through dialogue experiments to evaluate its full potential. The research is well-supported by significant funding sources, positioning it as a critical step forward in the evolution of human-machine communication.

PDF Markdown

Tweets

https://twitter.com/inokoj/status/1745356537672175898