Papers
Topics
Authors
Recent
2000 character limit reached

Language Model Can Listen While Speaking (2408.02622v1)

Published 5 Aug 2024 in cs.CL, cs.AI, cs.HC, cs.SD, and eess.AS

Abstract: Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech LLMs (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech LLMs (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking LLM (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

Citations (11)

Summary

  • The paper introduces a full duplex Listening-while-Speaking Language Model (LSLM) that fuses a decoder-only TTS model with a streaming SSL encoder to enable simultaneous listening and speaking.
  • The Middle Fusion strategy outperforms Early and Late Fusion, achieving a 4.05% word error rate and 98% F1 score in real-time interactive settings.
  • The model demonstrates robust performance in handling interruptions and adapting to new speakers, setting the groundwork for advanced real-time human-computer interaction.

LLM Can Listen While Speaking

Introduction

The paper "LLM Can Listen While Speaking" (2408.02622) introduces a novel approach to interactive speech LLMs (iSLM) by incorporating full duplex modeling (FDM), a crucial feature enabling simultaneous listening and speaking. Traditional speech LLMs are often constrained to turn-based interactions, limiting their applicability in real-time human-computer interaction (HCI) scenarios. This research addresses these limitations by proposing the Listening-while-Speaking LLM (LSLM), which integrates both a decoder-only TTS model for speech generation and a streaming SSL encoder for real-time audio processing.

Model Architecture

The proposed LSLM architecture comprises two main components: a token-based decoder-only Transformer for generating speaking tokens and a streaming SSL encoder for processing listening tokens. This configuration allows the model to engage in full duplex communication, enhancing its capability to detect and respond to turn-taking in real-time interactions. Figure 1

Figure 1: Illustration of simplex, half duplex, and full duplex speech LLMs. (A): Simplex speech LLM with listening ability. (B): Simplex speech LLM with speaking ability. (C): Half duplex speech LLM with both listening and speaking abilities. (D): Full duplex speech LLM can listen while speaking.

To achieve effective integration of speaking and listening channels, the authors explore three fusion strategies: Early Fusion, Middle Fusion, and Late Fusion. Among these, Middle Fusion demonstrates superior performance, striking an optimal balance between speech generation and real-time interaction capabilities. Figure 2

Figure 2: Proposed LSLM. The model contains a decoder-only Transformer to generate speaking tokens and a streaming SSL encoder to process listening tokens. An interruption token (IRQ) is added to allow the model to terminate early if a turn-taking occurs.

Experimental Evaluation

The paper evaluates LSLM's performance in two experimental settings: command-based FDM and voice-based FDM. The results reveal that Middle Fusion outperforms other strategies in both TTS capability and interactive functionality. The model achieves a word error rate (WER) of 4.05% in clean conditions and maintains robust interaction performance with a Precision of 97.80%, Recall of 98.19%, and an F1 score of 98.00%.

Moreover, the model exhibits enhanced robustness to noise and sensitivity to previously unseen speakers, demonstrating its effectiveness in real-world scenarios. For instance, voice-based FDM tests highlight the model's ability to handle diverse interruption commands and adapt to new speakers without significant performance degradation.

Visualization and Analysis

To further understand the model's turn-taking mechanism, the researchers visualize the probability distribution of interruption tokens (IRQ) over time. This analysis reveals that IRQ probabilities increase sharply when real-time turn-taking signals are detected, allowing for effective interruption management. Figure 3

Figure 3: Illustration of the probability distribution of IRQ tokens (being interrupted) over time. The logarithmic scale probability is used for clear visualization.

Future Directions

This research paves the way for future developments in duplex communication within speech LLMs. Potential avenues for exploration include integrating audiovisual co-guidance to enhance turn-taking, developing fully duplex speech-in speech-out dialogue systems, and refining models to accommodate a wider array of environmental conditions and speaker variations.

Conclusion

The introduction of the LSLM marks a significant advancement in interactive speech LLMs by enabling full duplex communication. This capability facilitates more natural and flexible HCI, with implications for a wide range of applications, from virtual assistants to more sophisticated conversational agents. The research highlights that integrating middle fusion strategies significantly enhances both speech generation and interactive capabilities, setting a foundation for future innovations in interactive AI systems.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 15 tweets with 385 likes about this paper.