Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

LLM-based speaker diarization correction: A generalizable approach (2406.04927v3)

Published 7 Jun 2024 in eess.AS and cs.CL

Abstract: Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of LLMs for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.

Citations (2)

Summary

  • The paper introduces a novel method that uses fine-tuned LLMs to post-process and correct speaker mislabeling in ASR-generated transcripts.
  • It employs an ensemble of ASR-specific models fine-tuned via QLoRA to generalize across different systems and significantly reduce error metrics.
  • The approach enhances practical applications by achieving improved diarization accuracy without increasing inference time.

LLM-Based Speaker Diarization Correction: A Generalizable Approach

The paper "LLM-based speaker diarization correction: A generalizable approach" (2406.04927) explores the application of LLMs to improve speaker diarization accuracy in transcripts generated by Automated Speech Recognition (ASR) systems. The paper presents a post-processing strategy leveraging fine-tuned LLMs to correct speaker mislabeling and offers a pathway towards achieving ASR-agnostic diarization tools.

Introduction and Background

Speaker diarization, the process of identifying distinct speakers in audio recordings, is crucial for accurate conversation interpretation, particularly in domains like medical transcription and legal proceedings. Existing diarization solutions often involve acoustic analysis through methods like Pyannote or x-vector clustering. End-to-end systems integrating transcription with diarization are also popular but have limitations due to dependencies on specific ASR outputs.

ASR systems such as AWS Transcribe, Azure Speech to Text, and WhisperX vary in their diarization techniques and accuracy. The inconsistency across ASRs affects the generalizability of post-processing correction methods. To address these challenges, the authors fine-tune LLMs using the Fisher corpus, a vast dataset of conversational transcripts, to understand and correct diarization errors. Figure 1

Figure 1: Accurate speaker diarization is necessary for interpretation of important conversations.

Methodology

Data and Preprocessing

The English Fisher corpus, comprising 1,960 hours of transcribed phone conversations, is utilized for fine-tuning and evaluating the LLM models. This dataset supports diverse speaking styles and demographics, providing a robust foundation for training. ASR tools from AWS, Azure, and WhisperX generate the initial transcripts, which are then standardized and pre-processed to align speaker labels using a Transcript-Preserving Speaker Transfer (TPST) algorithm. Figure 2

Figure 2: Creation of oracle transcripts using the TPST algorithm. Words and speaker labels are extracted from each transcript. The algorithm aligns word sequences, such that the resulting speaker labels from the reference transcript match the text of the ASR transcript. This corrects speaker labeling in the ASR transcript without changing the underlying transcription.

Model Fine-Tuning and Evaluation

The Mistral 7B model serves as the baseline for fine-tuning. Separate models tailored for AWS, Azure, and WhisperX transcripts are trained using Quantized Low-Rank Adaptation (QLoRA) and Flash attention for computational efficiency. An ensemble model combining these ASR-specific models is developed to enhance generalization across different ASR systems.

Diarization accuracy is measured using delta concatenated minimum-permutation word error rate (deltaCP) and delta speaker-attributed word error rate (deltaSA), metrics that isolate errors introduced by speaker mislabeling from transcription inaccuracies.

Results

Baseline ASR outputs demonstrate varying levels of word error rates (WER) and diarization accuracy, with Azure showing the best and WhisperX the poorest performance. The fine-tuned LLMs markedly improve diarization accuracy, as evidenced by significant reductions in deltaCP and deltaSA values compared to baseline error rates.

The ensemble model outperforms individual ASR-specific models, showcasing its ability to generalize across different ASR outputs. The improvement in accuracy was achieved without increasing inference time or altering the model architecture, suggesting practical applicability for real-time processing.

Discussion

The results underscore the necessity of fine-tuning for LLM-driven diarization correction. Zero-shot models, lacking task-specific adaptation, perform poorly due to diverse ASR output characteristics. Although fine-tuned models exhibit ASR-specific improvements, the ensemble approach demonstrates potential for broader applicability and better generalization, even for previously untested ASR tools like GCP.

Future work can address the limitations by testing on diverse conversational domains, extending to multilingual datasets, and integrating multimodal data for robust solutions. Incorporating contextual metadata (e.g., speaker roles) could enhance the system's understanding and accuracy of speaker labeling.

Conclusion

The research highlights the effectiveness of fine-tuning LLMs for diarization correction while identifying limitations in ASR-specific adaptations. The proposed ensemble model offers a promising path towards achieving ASR-agnostic speaker diarization solutions, facilitating integration into a wide range of applications dependent on accurate transcription and speaker differentiation. By making these models accessible through public APIs, the paper paves the way for improved computational tools in fields reliant on precise audio analysis.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.