WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Published 1 Mar 2023 in cs.SD and eess.AS | (2303.00747v2)

Abstract: Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.

Abstract PDF HTML Upgrade to Chat

Authors (4)

References (1)

``Which automatic transcription service is the most accurate?'' https://medium.com/descript/which-automatic-transcription-service-is-the-most-accurate-2018-2e859b23ed19, accessed: 2023-04-27.

Citations (145)

View on Semantic Scholar

Summary

The paper introduces WhisperX, a novel system that enhances transcription speed and accuracy via VAD segmentation, min-cut processing, and forced phoneme alignment.
It employs parallel transcription of segmented audio to maximize hardware utilization and minimize boundary errors for precise word-level timestamps.
Experimental results show a 12-fold speed increase and superior word segmentation recall and precision compared to existing speech recognition models.

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

The paper "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio" presents a novel system for efficient and precise transcription of long-form audio data. Developed by researchers from the Visual Geometry Group at the University of Oxford, WhisperX effectively addresses the limitations of existing speech recognition models by providing accurate word-level timestamps and improving transcription speed.

Overview

WhisperX builds upon the Whisper speech recognition model, which is known for its robust multilingual transcription capabilities using large-scale, weakly-supervised datasets. However, traditional implementations of Whisper face significant challenges in the context of long-form audio transcription. Specifically, the absence of out-of-the-box word-level timestamps and the inefficiency of sequential transcription methods pose notable hurdles.

To overcome these limitations, WhisperX employs a multi-stage approach:

Voice Activity Detection (VAD): The initial stage utilizes a VAD model to segment audio based on speech activity. This segmentation avoids unnecessary transcription during silent periods and minimizes boundary errors, enabling parallel transcription of audio segments.
VAD Cut and Merge: The paper introduces a min-cut strategy to segment long speech segments and merges shorter segments to optimize transcription context. This approach aligns the segments to the limitations of ASR model input durations and enhances transcription speed and accuracy.
Parallel Transcription: The segmented audio is transcribed in parallel, maximizing hardware utilization and improving throughput.
Forced Phoneme Alignment: Finally, WhisperX applies an external phoneme recognition model for alignment, ensuring accurate word-level timestamps.

Experimental Evaluation

The authors conduct a comprehensive evaluation on multiple datasets, including the AMI Meeting Corpus, Switchboard-1, TED-LIUM, and Kincaid46. The results highlight the efficacy of WhisperX in achieving state-of-the-art performance on both transcription quality and word segmentation, surpassing existing solutions like Whisper and wav2vec2.0.

Key findings include:

Transcription Speed: Through the VAD pre-processing and batch transcription strategies, WhisperX achieves a twelve-fold increase in transcription speed compared to Whisper, while maintaining transcription accuracy.
Word Segmentation: The system delivers superior recall and precision in word segmentation tasks, effectively managing timestamp inaccuracies inherent in sequential transcription models.
Error Reduction: The innovative VAD Cut and Merge strategy reduces insertion errors and transcription repetition, prevalent challenges in buffered transcription systems.

Implications and Future Directions

The development of WhisperX holds significant implications for practical applications requiring efficient and accurate long-form audio transcription. Its capabilities are particularly beneficial for domains such as automatic subtitling, audio content indexing, and voice diarization.

From a theoretical standpoint, WhisperX demonstrates the benefits of integrating phoneme-level forced alignment with ASR models, emphasizing the potential for refinement in the models' ability to capture temporal structures in speech data.

Future work may explore the development of end-to-end systems that can inherently generate accurate word-level timestamps. There is also potential to expand the multilingual capabilities of WhisperX by leveraging broader phoneme model training datasets.

In conclusion, WhisperX exemplifies a significant advancement in the field of speech recognition, offering a robust solution to the challenges of transcribing long-form audio with precision and efficiency. The release of its open-source code further encourages future research and development, fostering innovation in speech processing technologies.

Markdown Report Issue