Language Model Can Listen While Speaking (2408.02622v1)

Published 5 Aug 2024 in cs.CL, cs.AI, cs.HC, cs.SD, and eess.AS

Abstract: Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech LLMs (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech LLMs (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking LLM (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces Full Duplex Modeling (FDM) in interactive speech language models, enabling simultaneous listening and speaking in real-time dialogs.
It proposes the Listening-while-Speaking Language Model (LSLM) with a middle fusion approach that achieves competitive Word Error Rates and high turn-taking accuracy.
Experimental results demonstrate robust performance under both command-based and voice-based conditions, even in noisy environments.

LLM Can Listen While Speaking

The paper "LLM Can Listen While Speaking" by Ma et al. primarily addresses the limitations of turn-based conversational AI models by proposing a novel approach to real-time human-computer interaction (HCI) in speech LLMs (SLMs). The authors introduce the concept of Full Duplex Modeling (FDM) within interactive speech LLMs (iSLMs), which enables simultaneous listening and speaking during conversations. This paper presents the design and evaluation of an innovative model called the Listening-while-Speaking LLM (LSLM), which demonstrates this capability.

Introduction

The research begins by highlighting the naturalness of dialogue as a mode of HCI and reviews recent advancements in speech LLMs based on LLMs. The existing models lack the ability to handle real-time spoken interactions due to their turn-based nature. To overcome this limitation, the authors propose FDM for iSLMs, aiming to improve real-time interaction by equipping the models with the ability to handle interruptions effectively.

Model Design

The LSLM is introduced as an end-to-end system incorporating both listening and speaking channels. For speech generation, the model uses a token-based decoder-only Text-to-Speech (TTS) system. For real-time audio input, it employs a streaming self-supervised learning (SSL) encoder. The fusion of these channels enables autoregressive generation and real-time turn-taking detection.

Three fusion strategies are explored:

Early Fusion: Integrates the listening and speaking channels at the input embeddings before autoregressive prediction.
Middle Fusion: Merges the channels at each Transformer block by adding the listening channel to the input of each Transformer block.
Late Fusion: Combines the channels at the output logits before the softmax operation.

Among these strategies, the middle fusion approach is found to be the most effective in balancing the requirements of speech generation and real-time interaction.

Experimental Evaluation

The LSLM's performance is evaluated under two experimental settings: command-based FDM and voice-based FDM. The experiments demonstrate that the LSLM is robust to noise and can respond to a variety of instructions from unseen speakers. The following key results were reported:

The command-based FDM setting showed that the middle fusion approach achieved a Word Error Rate (WER) of 4.05% in clean conditions and 4.51% in noisy conditions, indicating minimal impact on the speech generation capability.
For interactive capability, the LSLM with middle fusion achieved high precision (97.80%), recall (98.19%), and F1 score (98.00%) under clean conditions, and performed competitively under noisy conditions.
The voice-based FDM setting, which involved diverse interruption commands and unseen speakers, further challenged the model. Despite this, the LSLM maintained reasonable performance with a WER of 5.33% in clean conditions and 8.50% in noisy conditions.

Implications and Future Directions

The introduction of FDM in iSLMs marks a significant step toward enhancing real-time spoken interaction in conversational AI. The ability to handle interruptions in real-time not only improves user experience but also broadens the applicability of speech dialogue systems in dynamic and noisy environments.

Future research directions could include integrating audiovisual co-guidance mechanisms to improve turn-taking, enhancing speaker-following capabilities to identify interrupting speakers more accurately, and expanding the duplex modeling to more complex dialogue scenarios involving multiple speakers and more intricate interaction patterns.

Conclusion

The paper by Ma et al. presents a comprehensive approach to overcoming the limitations of turn-based SLMs by introducing full duplex modeling. The proposed LSLM demonstrates the feasibility and effectiveness of simultaneous listening and speaking in speech LLMs. By achieving significant results in real-time interaction metrics, this paper lays the groundwork for the development of more advanced and user-friendly interactive speech dialogue systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1820663556540412086

https://twitter.com/_akhaliq/status/1820663500441559084

https://twitter.com/_akhaliq/status/1820663473858056413

https://twitter.com/_akhaliq/status/1820663462936096980

https://twitter.com/TheAITimeline/status/1822371732122157330

https://twitter.com/tonymongkolsmai/status/1821508577367896173

YouTube

Show All Videos