Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Very Deep Multilingual Convolutional Neural Networks for LVCSR (1509.08967v2)

Published 29 Sep 2015 in cs.CL and cs.NE

Abstract: Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3x3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77% WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8% after cross-entropy training, a 1.4% WER improvement (10.6% relative) over the best published CNN result so far.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tom Sercu (17 papers)
  2. Christian Puhrsch (9 papers)
  3. Brian Kingsbury (54 papers)
  4. Yann LeCun (173 papers)
Citations (223)

Summary

  • The paper introduces a very deep CNN architecture, inspired by VGG-19, featuring up to 14 layers with small 3x3 kernels to enhance expressiveness over traditional models.
  • It presents a multilingual training strategy that shares initial network layers across languages to boost performance in low-resource speech recognition tasks.
  • The study reports substantial improvements in Word Error Rate, achieving a 5.77% absolute gain on Babel and a 1.4% relative gain on Hub5'00 benchmarks.

Very Deep Multilingual Convolutional Neural Networks for LVCSR

The paper under discussion investigates architectural advancements in Convolutional Neural Networks (CNNs) to enhance performance in Large Vocabulary Continuous Speech Recognition (LVCSR). It builds upon the foundational role of CNNs in successive state-of-the-art models and addresses the lag in their performance relative to other fields where deeper neural network architectures have achieved superior results. The research introduces a very deep convolutional network design with multiple levels of convolution before each pooling layer, inspired by the VGG-19 model, and examines multilingual and multi-scale input techniques.

Key Contributions

  1. Deep CNN Architectures: One of the principal contributions is the adaptation of very deep CNN architectures with up to 14 weight layers, inspired by the VGG Net model. These architectures utilize smaller 3x3 kernels but in multiple convolutional layers without intermediate pooling, increasing expressiveness and reducing parameters. This contrasts with traditional CNN structures in speech recognition, which typically consisted of only two convolutional layers with larger kernels.
  2. Multilingual Training: The authors present a strategy for training CNNs across multiple languages, leveraging their data simultaneously, a practice not previously extended to CNNs. This approach facilitates improved performance in low-resource language tasks by sharing initial network layers across languages, while untying the upper layers for language-specific tasks.
  3. Multi-Scale Input Features: By using multi-scale input features, the research achieves an increased context with minimal computational overhead. This leverages feature maps downsampled from larger context windows to allow the network to consider wider input ranges.

Experimental Results

Through rigorous evaluation, substantial improvements are demonstrated on both the Babel and Switchboard datasets. The key numerical results highlight:

  • On a Babel task, the proposed CNN architecture achieved a 5.77% absolute improvement in Word Error Rate (WER) over the baseline system by training on data from six different languages.
  • The network also achieved a 1.4% relative improvement in WER, achieving 11.8% WER on the Hub5'00 benchmark, surpassing prior published results for CNN-based LVCSR systems.

Implications and Future Prospects

The research underscores the potential of deeper network architectures in LVCSR and advocates for leveraging multilingual datasets for model training, particularly in low-resource scenarios. The improvement in WER demonstrates a tangible potential for real-world applications in multilingual contexts, which is significant for global accessibility and communication technologies.

Future exploration may involve sequence training techniques, joint CNN and DNN integration, and enhancements like annealed dropout and maxout non-linearities. These efforts could further expand the efficacy and applicability of deep CNNs in LVCSR, contributing to more accurate and efficient speech recognition systems across varied contexts and languages. Such advancements may also set the stage for exploring similar approaches across other domains requiring complex pattern recognition.