1,627 research outputs found
Introducing Phonetic Information to Speaker Embedding for Speaker Verification
Phonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods
Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition
Pretrained contextual word representations in NLP have greatly improved
performance on various downstream tasks. For speech, we propose contextual
frame representations that capture phonetic information at the acoustic frame
level and can be used for utterance-level language, speaker, and speech
recognition. These representations come from the frame-wise intermediate
representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken
utterances. We first train the model on the Fisher English corpus with
context-independent phoneme labels, then use its representations at inference
time as features for task-specific models on the NIST LRE07 closed-set language
recognition task and a Fisher speaker recognition task, giving significant
improvements over the state-of-the-art on both (e.g., language EER of 4.68% on
3sec utterances, 23% relative reduction in speaker EER). Results remain
competitive when using a novel dilated convolutional model for language
recognition, or when ASR pretraining is done with character labels only.Comment: submitted to INTERSPEECH 201
Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation
This paper explores the use of ASR-pretrained Conformers for speaker
verification, leveraging their strengths in modeling speech signals. We
introduce three strategies: (1) Transfer learning to initialize the speaker
embedding network, improving generalization and reducing overfitting. (2)
Knowledge distillation to train a more flexible speaker verification model,
incorporating frame-level ASR loss as an auxiliary task. (3) A lightweight
speaker adaptor for efficient feature conversion without altering the original
ASR Conformer, allowing parallel ASR and speaker verification. Experiments on
VoxCeleb show significant improvements: transfer learning yields a 0.48% EER,
knowledge distillation results in a 0.43% EER, and the speaker adaptor
approach, with just an added 4.92M parameters to a 130.94M-parameter model,
achieves a 0.57% EER. Overall, our methods effectively transfer ASR
capabilities to speaker verification tasks
- …