Listening while Speaking: Speech Chain by Deep Learning

Published 16 Jul 2017 in cs.CL, cs.LG, and cs.SD | (1707.04879v1)

Abstract: Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence on each other. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop speech chain model based on deep learning. The sequence-to-sequence model in close-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS attempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text transcription given the synthesized speech. To the best of our knowledge, this is the first deep learning model that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved the performance more than separate systems that were only trained with labeled data.

Abstract PDF Upgrade to Chat

Citations (164)

View on Semantic Scholar

Summary

The paper presents a deep learning Speech Chain model that integrates ASR and TTS in a closed-loop system to enhance performance.
It utilizes both labeled and unlabeled data to significantly reduce error rates and improve speech processing accuracy.
Experimental results demonstrate a 4.6% reduction in character error rates for single-speaker data and robust gains in multi-speaker settings.

Listening while Speaking: A Speech Chain by Deep Learning

This paper introduces an innovative deep learning-based approach named the "Speech Chain" model. The research aims to harness the intertwined nature of speech perception and production, which have evolved independently in automated systems, such as automatic speech recognition (ASR) and text-to-speech synthesis (TTS). The core idea revolves around a closed-loop architecture that mimics human speech communication, complete with auditory feedback. This allows for the simultaneous processing of labeled and unlabeled data, enabling ASR to transcribe input speech features while TTS reconstructs the speech waveform from ASR transcriptions. Conversely, ASR retrieves text sequences from TTS-generated speech. This mutual learning process between ASR and TTS is proposed for boosting performance without demanding extensive labeled datasets.

The paper presents strong numerical results validating the closed-loop mechanism. Experimental setups include single-speaker synthetic datasets and multi-speaker natural speech corpora. In both scenarios, the Speech Chain model significantly outperforms traditional systems trained solely on labeled data. In single-speaker tests, character error rates were decreased by approximately 4.6%, which highlights the model's capability to leverage unlabeled data effectively. In the multi-speaker context, the model similarly demonstrated marked improvements in ASR and TTS performance, suggesting its robustness across diverse speaking styles and conditions.

The implications of this work span both practical and theoretical domains. Practically, the Speech Chain model promises reduced dependency on labeled data for training, making it a cost-effective and scalable solution for speech processing tasks. Theoretically, it paves the way for integrated models that emulate human cognitive processes more closely, fostering further research into closed-loop systems in AI. Future research could explore various languages, spontaneous speech conditions, and emotional speech nuances to validate the model's versatility and adaptiveness in diverse scenarios.

In conclusion, this research marks a significant advancement in aligning ASR and TTS processes using deep learning techniques. The Speech Chain architecture offers a novel paradigm that not only improves system accuracy but also reduces reliance on labeled data, making strides towards more intelligent and adaptive spoken language systems. Researchers and practitioners in AI and machine learning can look forward to extending this work to other domains where perception and production modalities must cooperate in unison.

Markdown Report Issue