Search CORE

25 research outputs found

Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

Author: Chien Jen-Tzung
Gan Chong-Xin
Lin Weiwei
Mak Man-Wai
Publication venue
Publication date: 08/09/2023
Field of study

Contrastive self-supervised learning (CSL) for speaker verification (SV) has drawn increasing interest recently due to its ability to exploit unlabeled data. Performing data augmentation on raw waveforms, such as adding noise or reverberation, plays a pivotal role in achieving promising results in SV. Data augmentation, however, demands meticulous calibration to ensure intact speaker-specific information, which is difficult to achieve without speaker labels. To address this issue, we introduce a novel framework by incorporating clean and augmented segments into the contrastive training pipeline. The clean segments are repurposed to pair with noisy segments to form additional positive and negative pairs. Moreover, the contrastive loss is weighted to increase the difference between the clean and augmented embeddings of different speakers. Experimental results on Voxceleb1 suggest that the proposed framework can achieve a remarkable 19% improvement over the conventional methods, and it surpasses many existing state-of-the-art techniques.Comment: 5 pages, 2 figures, submitted to ICASSP 202

arXiv.org e-Print Archive

Self multi-head attention for speaker recognition

Author: Hernando Pericás Francisco Javier
India Massana Miquel Àngel
Safari Pooyan
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2019
Field of study

Most state-of-the-art Deep Learning (DL) approaches forspeaker recognition work on a short utterance level. Given thespeech signal, these algorithms extract a sequence of speakerembeddings from short segments and those are averaged to ob-tain an utterance level speaker representation. In this work wepropose the use of an attention mechanism to obtain a discrim-inative speaker embedding given non fixed length speech utter-ances. Our system is based on a Convolutional Neural Network(CNN) that encodes short-term speaker features from the spec-trogram and a self multi-head attention model that maps theserepresentations into a long-term speaker embedding. The atten-tion model that we propose produces multiple alignments fromdifferent subsegments of the CNN encoded states over the se-quence. Hence this mechanism works as a pooling layer whichdecides the most discriminative features over the sequence toobtain an utterance level representation. We have tested thisapproach for the verification task for the VoxCeleb1 dataset.The results show that self multi-head attention outperforms bothtemporal and statistical pooling methods with a18%of rela-tive EER. Obtained results show a58%relative improvementin EER compared to i-vector+PLDAPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC