7,530 research outputs found
Analyzing And Improving Neural Speaker Embeddings for ASR
Neural speaker embeddings encode the speaker's speech characteristics through
a DNN model and are prevalent for speaker verification tasks. However, few
studies have investigated the usage of neural speaker embeddings for an ASR
system. In this work, we present our efforts w.r.t integrating neural speaker
embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved
embedding extraction pipeline in combination with the Weighted-Simple-Add
integration method results in x-vector and c-vector reaching on par performance
with i-vectors. We further compare and analyze different speaker embeddings. We
present our acoustic model improvements obtained by switching from newbob
learning rate schedule to one cycle learning schedule resulting in a ~3%
relative WER reduction on Switchboard, additionally reducing the overall
training time by 17%. By further adding neural speaker embeddings, we gain
additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based
hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and
Hub5'01 with training on SWB 300h.Comment: Accepted at ITG Speech Communications 202
Syllable classification using static matrices and prosodic features
In this paper we explore the usefulness of prosodic features for
syllable classification. In order to do this, we represent the
syllable as a static analysis unit such that its acoustic-temporal
dynamics could be merged into a set of features that the SVM
classifier will consider as a whole. In the first part of our
experiment we used MFCC as features for classification,
obtaining a maximum accuracy of 86.66%. The second part of
our study tests whether the prosodic information is
complementary to the cepstral information for syllable
classification. The results obtained show that combining the
two types of information does improve the classification, but
further analysis is necessary for a more successful
combination of the two types of features
Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation
This paper explores the use of ASR-pretrained Conformers for speaker
verification, leveraging their strengths in modeling speech signals. We
introduce three strategies: (1) Transfer learning to initialize the speaker
embedding network, improving generalization and reducing overfitting. (2)
Knowledge distillation to train a more flexible speaker verification model,
incorporating frame-level ASR loss as an auxiliary task. (3) A lightweight
speaker adaptor for efficient feature conversion without altering the original
ASR Conformer, allowing parallel ASR and speaker verification. Experiments on
VoxCeleb show significant improvements: transfer learning yields a 0.48% EER,
knowledge distillation results in a 0.43% EER, and the speaker adaptor
approach, with just an added 4.92M parameters to a 130.94M-parameter model,
achieves a 0.57% EER. Overall, our methods effectively transfer ASR
capabilities to speaker verification tasks
- …