Emergent Mind

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

(2401.04868)
Published Jan 10, 2024 in cs.CL , cs.HC , cs.SD , and eess.AS

Abstract

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

Overview

  • The paper introduces a new framework for predicting turn-taking in conversations called voice activity projection (VAP).

  • VAP uses contrastive predictive coding (CPC) and transformer networks to map audio to future voice activities.

  • The model predicts future speaking probabilities and detects current voice activity in real-time using multitask learning.

  • Performance evaluations demonstrate that the VAP system can accurately predict speaking turns on a CPU in real-time without clear cues.

  • The research paves the way for more nuanced SDSs and calls for further integration and testing in various dialogue scenarios.

Introduction to Turn-taking Prediction

Turn-taking is an integral part of communication, where predicting when a speaker will start or stop speaking is key to facilitating smooth dialogue exchanges. Traditional spoken dialogue systems (SDSs) often rely on fixed silence thresholds to determine speaker transitions, which can lead to interruptions or delays in responses. This approach can overlook the complexities involved in human conversational patterns, where pauses are not necessarily indicative of turn-end.

Voice Activity Projection Model

Addressing these challenges, a new framework for turn-taking prediction has been introduced, known as voice activity projection (VAP). This innovative model directly maps stereo audio of dialogue to future voice activities using contrastive predictive coding (CPC) and transformer networks. It functions by anticipating near future voice activities, processing the audio signal via multi-layer transformers, and utilizing a multitask learning approach for both voice activity projection (predicting future speaking probability) and detection (identifying current voice activity).

The VAP System in Action

The performance of the VAP system has been examined and its potential in real-time applications showcased. Crucially, the model's ability to function effectively in real-time scenarios using only a CPU has been confirmed. The study explored various input sequence lengths, demonstrating that a balance can be struck between maintaining high accuracy in prediction and achieving the necessary processing speed for real-time interaction. The VAP model, through an explanatory GUI, accurately indicates the likelihood of a speaker either continuing their turn or transitioning to the other participant, including during moments of uncertainty where next speaker is not immediately clear.

Future Directions and Conclusion

The presented work demonstrates the possibility of enhancing SDSs with a level of turn-taking nuance previously unattainable with simple silence timeout thresholds. With robust performance even with reduced input sequence lengths, the VAP model emerges as a promising solution for real-time SDSs. Future pursuits include further integration of the VAP system into various dialogue systems and extensive testing through dialogue experiments to evaluate its full potential. The research is well-supported by significant funding sources, positioning it as a critical step forward in the evolution of human-machine communication.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.