Emergent Mind

Abstract

Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge that demands further investigation. Previous studies have noted a substantial performance disparity between self-supervised and fully supervised approaches. In this paper, we propose an effective Self-Distillation network with Ensemble Prototypes (SDEP) to facilitate self-supervised speaker representation learning. It assigns representation of augmented views of utterances to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the SDEP framework in self-supervised speaker verification. SDEP achieves a new state-of-the-art on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate 1.94%, 1.99%, and 3.77% for trial Vox1-O, Vox1-E and Vox1-H , respectively), without using any speaker labels in the training phase.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.