Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained with Noise Signals (1807.11722v1)

Published 31 Jul 2018 in eess.AS, cs.LG, and cs.SD

Abstract: Supervised learning based methods for source localization, being data driven, can be adapted to different acoustic conditions via training and have been shown to be robust to adverse acoustic environments. In this paper, a convolutional neural network (CNN) based supervised learning method for estimating the direction-of-arrival (DOA) of multiple speakers is proposed. Multi-speaker DOA estimation is formulated as a multi-class multi-label classification problem, where the assignment of each DOA label to the input feature is treated as a separate binary classification problem. The phase component of the short-time Fourier transform (STFT) coefficients of the received microphone signals are directly fed into the CNN, and the features for DOA estimation are learnt during training. Utilizing the assumption of disjoint speaker activity in the STFT domain, a novel method is proposed to train the CNN with synthesized noise signals. Through experimental evaluation with both simulated and measured acoustic impulse responses, the ability of the proposed DOA estimation approach to adapt to unseen acoustic conditions and its robustness to unseen noise type is demonstrated. Through additional empirical investigation, it is also shown that with an array of M microphones our proposed framework yields the best localization performance with M-1 convolution layers. The ability of the proposed method to accurately localize speakers in a dynamic acoustic scenario with varying number of sources is also shown.

Citations (241)

View on Semantic Scholar

Summary

The paper introduces a novel CNN architecture that reformulates multi-speaker DOA estimation as a multi-label classification task using phase information from STFT coefficients.
It employs a design with M-1 convolution layers to capture phase correlations across microphone arrays, enhancing localization accuracy in reverberant settings.
Empirical results demonstrate superior performance against traditional methods, highlighting its potential for robust applications in telecommunications and robotic auditory systems.

Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained with Noise Signals

The paper presents a convolutional neural network (CNN)-based approach for estimating the direction-of-arrival (DOA) of multiple speakers, leveraging supervised learning informed by noise signals. The primary challenge addressed is the localization of multiple sound sources in complex, reverberant environments where traditional signal processing methods often fall short. Unlike conventional narrowband or broadband estimation frameworks, the authors reformulate the problem as a multi-class multi-label classification problem, which allows for the direct assimilation of phase information from short-time Fourier transform (STFT) coefficients into the CNN.

A noteworthy contribution of the paper is the innovative use of synthesized noise signals to train the CNN, circumventing the challenges associated with acquiring extensive real-world noisy datasets. This not only simplifies the data generation process but also enhances the network’s robustness against a variety of noise types in unforeseen acoustic conditions. Extensive empirical results demonstrate the efficacy of the proposed method, highlighting its ability to accurately localize speakers even under variable acoustic scenes with a dynamic number of sources.

Key Methodological Features

Input Representation: The system inputs the phase component of the STFT coefficients of received signals to learn DOA estimation features, incorporating the critical assumption of W-disjoint orthogonality for speech signals, which facilitates the training with noise inputs.
Proposed Network Architecture: The network architecture is predicated on the hypothesis that $M-1$ convolution layers (where $M$ is the number of microphones) are necessary to optimally capture phase correlations across a microphone array, a premise substantiated by experimental results.
Handling Multi-Source Localization: The proposed method treats each DOA label assignment as separate binary classification tasks, using binary relevance methods to simplify the multi-label problem.

Experimental Insights

The proposed method was tested rigorously across simulated and measured environments, revealing superior localization accuracy relative to traditional methods like MUSIC and SRP-PHAT, especially in high reverberation and varied noise conditions. The influence of parameters such as source-to-array distance and the number of convolution layers was methodically analyzed, confirming the theoretical underpinnings of the network's architecture design.

Implications and Future Directions

The implications of this research are significant for applications requiring robust multi-speaker localization under environmental variabilities, such as telecommunications, robotic auditory navigation, and smart home audio systems. By embedding feature extraction into the learning framework, the method shows enhanced adaptability to diverse scenarios, an advantage over static signal processing techniques. Future research may explore integration with advanced post-processing techniques for peak detection and handling near-field source challenges to further augment system flexibility. Additionally, expansion of the training regime to cater to a greater diversity of noise types and real-world conditions could leveraged to improve generalization.

Overall, this paper provides a substantial contribution to the field of acoustic signal processing by harnessing deep learning techniques to address complexities in multi-speaker localization, offering a viable pathway for robust source localization in varied and challenging acoustic environments.

PDF Markdown