Emergent Mind

Language Model Can Listen While Speaking

(2408.02622)
Published Aug 5, 2024 in cs.CL , cs.AI , cs.HC , cs.SD , and eess.AS

Abstract

Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

LSLM design with a decoder-only Transformer and a streaming SSL encoder, plus an interruption token.

Overview

  • The paper proposes a novel approach to real-time human-computer interaction in speech language models (SLMs) by introducing Full Duplex Modeling (FDM) for interactive speech language models (iSLMs).

  • An innovative Listening-while-Speaking Language Model (LSLM) is developed with three fusion strategies to enable simultaneous listening and speaking, with the middle fusion strategy proving most effective.

  • Experimental evaluations show that the LSLM performs robustly in both command-based and voice-based FDM settings, achieving high precision, recall, and F1 scores even in noisy conditions.

Language Model Can Listen While Speaking

The paper "Language Model Can Listen While Speaking" by Ma et al. primarily addresses the limitations of turn-based conversational AI models by proposing a novel approach to real-time human-computer interaction (HCI) in speech language models (SLMs). The authors introduce the concept of Full Duplex Modeling (FDM) within interactive speech language models (iSLMs), which enables simultaneous listening and speaking during conversations. This paper presents the design and evaluation of an innovative model called the Listening-while-Speaking Language Model (LSLM), which demonstrates this capability.

Introduction

The research begins by highlighting the naturalness of dialogue as a mode of HCI and reviews recent advancements in speech language models based on LLMs. The existing models lack the ability to handle real-time spoken interactions due to their turn-based nature. To overcome this limitation, the authors propose FDM for iSLMs, aiming to improve real-time interaction by equipping the models with the ability to handle interruptions effectively.

Model Design

The LSLM is introduced as an end-to-end system incorporating both listening and speaking channels. For speech generation, the model uses a token-based decoder-only Text-to-Speech (TTS) system. For real-time audio input, it employs a streaming self-supervised learning (SSL) encoder. The fusion of these channels enables autoregressive generation and real-time turn-taking detection.

Three fusion strategies are explored:

  1. Early Fusion: Integrates the listening and speaking channels at the input embeddings before autoregressive prediction.
  2. Middle Fusion: Merges the channels at each Transformer block by adding the listening channel to the input of each Transformer block.
  3. Late Fusion: Combines the channels at the output logits before the softmax operation.

Among these strategies, the middle fusion approach is found to be the most effective in balancing the requirements of speech generation and real-time interaction.

Experimental Evaluation

The LSLM's performance is evaluated under two experimental settings: command-based FDM and voice-based FDM. The experiments demonstrate that the LSLM is robust to noise and can respond to a variety of instructions from unseen speakers. The following key results were reported:

  • The command-based FDM setting showed that the middle fusion approach achieved a Word Error Rate (WER) of 4.05% in clean conditions and 4.51% in noisy conditions, indicating minimal impact on the speech generation capability.
  • For interactive capability, the LSLM with middle fusion achieved high precision (97.80%), recall (98.19%), and F1 score (98.00%) under clean conditions, and performed competitively under noisy conditions.
  • The voice-based FDM setting, which involved diverse interruption commands and unseen speakers, further challenged the model. Despite this, the LSLM maintained reasonable performance with a WER of 5.33% in clean conditions and 8.50% in noisy conditions.

Implications and Future Directions

The introduction of FDM in iSLMs marks a significant step toward enhancing real-time spoken interaction in conversational AI. The ability to handle interruptions in real-time not only improves user experience but also broadens the applicability of speech dialogue systems in dynamic and noisy environments.

Future research directions could include integrating audiovisual co-guidance mechanisms to improve turn-taking, enhancing speaker-following capabilities to identify interrupting speakers more accurately, and expanding the duplex modeling to more complex dialogue scenarios involving multiple speakers and more intricate interaction patterns.

Conclusion

The paper by Ma et al. presents a comprehensive approach to overcoming the limitations of turn-based SLMs by introducing full duplex modeling. The proposed LSLM demonstrates the feasibility and effectiveness of simultaneous listening and speaking in speech language models. By achieving significant results in real-time interaction metrics, this study lays the groundwork for the development of more advanced and user-friendly interactive speech dialogue systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube