25 research outputs found
Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification
Contrastive self-supervised learning (CSL) for speaker verification (SV) has
drawn increasing interest recently due to its ability to exploit unlabeled
data. Performing data augmentation on raw waveforms, such as adding noise or
reverberation, plays a pivotal role in achieving promising results in SV. Data
augmentation, however, demands meticulous calibration to ensure intact
speaker-specific information, which is difficult to achieve without speaker
labels. To address this issue, we introduce a novel framework by incorporating
clean and augmented segments into the contrastive training pipeline. The clean
segments are repurposed to pair with noisy segments to form additional positive
and negative pairs. Moreover, the contrastive loss is weighted to increase the
difference between the clean and augmented embeddings of different speakers.
Experimental results on Voxceleb1 suggest that the proposed framework can
achieve a remarkable 19% improvement over the conventional methods, and it
surpasses many existing state-of-the-art techniques.Comment: 5 pages, 2 figures, submitted to ICASSP 202
Self multi-head attention for speaker recognition
Most state-of-the-art Deep Learning (DL) approaches forspeaker recognition work on a short utterance level. Given thespeech signal, these algorithms extract a sequence of speakerembeddings from short segments and those are averaged to ob-tain an utterance level speaker representation. In this work wepropose the use of an attention mechanism to obtain a discrim-inative speaker embedding given non fixed length speech utter-ances. Our system is based on a Convolutional Neural Network(CNN) that encodes short-term speaker features from the spec-trogram and a self multi-head attention model that maps theserepresentations into a long-term speaker embedding. The atten-tion model that we propose produces multiple alignments fromdifferent subsegments of the CNN encoded states over the se-quence. Hence this mechanism works as a pooling layer whichdecides the most discriminative features over the sequence toobtain an utterance level representation. We have tested thisapproach for the verification task for the VoxCeleb1 dataset.The results show that self multi-head attention outperforms bothtemporal and statistical pooling methods with a18%of rela-tive EER. Obtained results show a58%relative improvementin EER compared to i-vector+PLDAPeer ReviewedPostprint (published version