Deep convolutional acoustic word embeddings using word-pair side information (1510.01032v2)

Published 5 Oct 2015 in cs.CL

Abstract: Recent studies have been revisiting whole words as the basic modelling unit in speech recognition and query applications, instead of phonetic units. Such whole-word segmental systems rely on a function that maps a variable-length speech segment to a vector in a fixed-dimensional space; the resulting acoustic word embeddings need to allow for accurate discrimination between different word types, directly in the embedding space. We compare several old and new approaches in a word discrimination task. Our best approach uses side information in the form of known word pairs to train a Siamese convolutional neural network (CNN): a pair of tied networks that take two speech segments as input and produce their embeddings, trained with a hinge loss that separates same-word pairs and different-word pairs by some margin. A word classifier CNN performs similarly, but requires much stronger supervision. Both types of CNNs yield large improvements over the best previously published results on the word discrimination task.

Authors (3)

Herman Kamper (80 papers)
Weiran Wang (65 papers)
Karen Livescu (89 papers)

Citations (171)

View on Semantic Scholar

Summary

The paper introduces a deep convolutional Siamese neural network architecture that learns fixed-dimensional acoustic word embeddings using only same/different word-pair side information.
Experiments show this Siamese CNN approach achieves 0.549 average precision on word discrimination, outperforming prior frame-level and DTW methods.
This method enables scalable speech processing applications, especially in low-resource scenarios, by leveraging weaker supervision to create effective embeddings.

Deep Convolutional Acoustic Word Embeddings Using Word-Pair Side Information

This paper contributes to the domain of acoustic word embeddings by introducing a novel approach based on deep convolutional neural networks (CNNs) with Siamese architectures, designed specifically for tasks requiring whole-word segment embeddings. The authors address limitations inherent in frame-level embeddings combined with dynamic time warping (DTW) by seeking out fixed-dimensional embeddings that can efficiently discriminate between different word types.

Methodology

The paper compares several approaches to acoustic word embeddings, focusing on the same-different word discrimination task. A key focus is on utilizing a weaker form of supervision in the form of known word pairs, enabling the model to be trained using only pairs of words that are either the same or different types, rather than requiring full label information for each word. This approach is leveraged within a Siamese CNN framework, where tied networks process pairs of speech segments and optimize a distance-based loss function.

Two CNN-based models are developed: the word classification CNN and the word similarity Siamese CNN. The word classification CNN utilizes full supervision with word labels to predict word types directly, while the Siamese CNN relies on word-pair side information, using a cosine-based hinge loss to learn embeddings that discriminate between same and different word pairs.

Crucially, the paper also explores dimensionality reduction strategies to produce compact yet effective embeddings. Linear discriminant analysis (LDA) is applied on the embeddings post training with the Siamese network, achieving a reduced-dimensional representation that maintains performance integrity.

Results

The experiments demonstrate that the proposed Siamese CNN using a hinge loss achieves an average precision (AP) of 0.549 on the test set, outperforming previously reported results utilizing frame-level embeddings with DTW or template-based reference vector approaches. In comparison, the word classifier CNN also shows compelling results, presenting similar AP scores while requiring greater supervision.

The dimensionality of the acoustic embeddings is highlighted as a significant factor, with the results indicating that even at reduced dimensions, the Siamese CNN maintains its superior performance compared to conventional approaches. This points to a higher utility for practical applications, where computational constraints and data efficiency are critical considerations.

Implications and Future Work

This research provides a substantial improvement over existing acoustic word embedding techniques, offering potential benefits for numerous applications in speech recognition and search systems, particularly in low-resource settings where full label information may not be available. The use of weaker supervision in combination with a robust embedding framework presents practical advantages for the development of scalable speech processing systems.

Future work suggested by the authors includes exploration of sequence models such as RNNs and LSTMs, and application of the developed embeddings to term discovery, recognition, and search tasks. These directions could further enhance the adaptiveness and effectiveness of acoustic word embeddings in more complex speech processing scenarios.

The paper positions the Siamese CNN framework as a promising tool for advancing the development and application of fixed-dimensional acoustic representations that effectively capture whole-word characteristics in the speech data, paving the way for innovations in speech technology.