Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks (2012.14952v1)

Published 29 Dec 2020 in eess.AS and cs.SD

Abstract: The recently proposed VBx diarization method uses a Bayesian hidden Markov model to find speaker clusters in a sequence of x-vectors. In this work we perform an extensive comparison of performance of the VBx diarization with other approaches in the literature and we show that VBx achieves superior performance on three of the most popular datasets for evaluating diarization: CALLHOME, AMI and DIHARDII datasets. Further, we present for the first time the derivation and update formulae for the VBx model, focusing on the efficiency and simplicity of this model as compared to the previous and more complex BHMM model working on frame-by-frame standard Cepstral features. Together with this publication, we release the recipe for training the x-vector extractors used in our experiments on both wide and narrowband data, and the VBx recipes that attain state-of-the-art performance on all three datasets. Besides, we point out the lack of a standardized evaluation protocol for AMI dataset and we propose a new protocol for both Beamformed and Mix-Headset audios based on the official AMI partitions and transcriptions.

Citations (187)

View on Semantic Scholar

Summary

The paper introduces the VBx method, which leverages a Bayesian HMM with PLDA modeling of x-vectors to efficiently cluster speakers.
It demonstrates superior performance with a 4.42% DER on CALLHOME and new baselines on AMI and DIHARDII through standardized evaluation protocols.
The open-source implementation enables replicability and further research in real-world speaker diarization applications.

Bayesian HMM Clustering of X-Vector Sequences (VBx) in Speaker Diarization

The paper "Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks" presents the VBx diarization method, which utilizes a Bayesian hidden Markov model (BHMM) for clustering x-vectors in speaker diarization tasks. The authors offer a comprehensive analysis across established datasets, including CALLHOME, AMI, and DIHARDII, thus establishing the importance and effectiveness of the VBx approach.

VBx Diarization Methodology

The core of the VBx method relies on a Bayesian HMM to model x-vectors through speaker-specific state distributions derived from a PLDA model. This approach aims to improve the computational simplicity and efficiency when compared to previous methods that relied heavily on frame-by-frame Cepstral feature analysis. The VBx model operates on the premise that the input sequence is controlled by an HMM where the speaker distributions are modeled using a PLDA approach, significantly reducing complexity while maintaining high accuracy.

Key Contributions and Results

Improved Performance: The VBx method is benchmarked against several datasets, clearly demonstrating its superior performance over previous diarization methods.
- On the CALLHOME dataset, VBx achieves a diarization error rate (DER) of 4.42% under a forgiving evaluation setup, outperforming existing methods.
- For the AMI and DIHARDII datasets, VBx sets new performance baselines while offering insights into evaluation protocols.
Standardization of Evaluation Protocols: A notable contribution is in the standardization efforts for evaluating the AMI dataset. This includes proposing a consistent protocol that facilitates meaningful comparison across different studies. The new protocol adopts the official Full-corpus-ASR partition, ensuring no speaker overlap between training, development, and evaluation sets, and follows a systematic approach in handling speech annotations.
Open-source Availability: The authors provide open-source access to their implementation, including recipes for training x-vector extractors, thus offering a practical resource for the community to replicate and further develop VBx-based solutions.

Implications and Future Directions

The VBx diarization method presents a significant development in the robustness and efficiency of speaker diarization technologies. By simplifying the model architecture and demonstrating high efficacy through empirical results, VBx sets a practical standard for speaker clustering using x-vectors. This research suggests several paths for future exploration:

Integration with real-world VAD systems to extend beyond oracle VAD.
Exploration of VBx with different types of embeddings beyond x-vectors, potentially enhancing its ability to handle a broader scope of speaker diarization challenges.
Application to more diverse datasets or real-time scenarios where the combination of accuracy and computational cost becomes crucial.

The insights provided by this paper underline the potential of Bayesian approaches in developing efficient and accurate machine learning models and offer a strong precedent for further algorithmic advancements in speaker diarization.

Related Papers

GitHub

GitHub - BUTSpeechFIT/VBx: Variational Bayes HMM over x-vectors diarization (271 stars)
GitHub - phonexiaresearch/VBx-training-recipe (29 stars)