19 research outputs found
Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition
In this work, we continue in our research on i-vector extractor for speaker
verification (SV) and we optimize its architecture for fast and effective
discriminative training. We were motivated by computational and memory
requirements caused by the large number of parameters of the original
generative i-vector model. Our aim is to preserve the power of the original
generative model, and at the same time focus the model towards extraction of
speaker-related information. We show that it is possible to represent a
standard generative i-vector extractor by a model with significantly less
parameters and obtain similar performance on SV tasks. We can further refine
this compact model by discriminative training and obtain i-vectors that lead to
better performance on various SV benchmarks representing different acoustic
domains.Comment: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note:
substantial text overlap with arXiv:1810.1318
Exploiting i–vector posterior covariances for short–duration language recognition
Linear models in i-vector space have shown to be an effective solution not only for speaker identification, but also for language recogniton. The i-vector extraction process, however, is affected by several factors, such as noise level, the acoustic content of the utterance and the duration of the spoken segments. These factors influence both the i-vector estimate and its uncertainty, represented by the i-vector posterior covariance matrix. Modeling of i-vector uncertainty with Probabilistic Linear Discriminant Analysis has shown to be effective for short-duration speaker identification. This paper extends the approach to language recognition, analyzing the effects of i-vector covariances on a state-of-the-art Gaussian classifier, and proposes an effective solution for the reduction of the average detection cost (Cavg) for short segments
Target Speech Extraction with Pre-trained Self-supervised Learning Models
Pre-trained self-supervised learning (SSL) models have achieved remarkable
success in various speech tasks. However, their potential in target speech
extraction (TSE) has not been fully exploited. TSE aims to extract the speech
of a target speaker in a mixture guided by enrollment utterances. We exploit
pre-trained SSL models for two purposes within a TSE framework, i.e., to
process the input mixture and to derive speaker embeddings from the enrollment.
In this paper, we focus on how to effectively use SSL models for TSE. We first
introduce a novel TSE downstream task following the SUPERB principles. This
simple experiment shows the potential of SSL models for TSE, but extraction
performance remains far behind the state-of-the-art. We then extend a powerful
TSE architecture by incorporating two SSL-based modules: an Adaptive Input
Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes
intermediate representations from the CNN encoder by adjusting the time
resolution of CNN encoder and transformer blocks through progressive
upsampling, capturing both fine-grained and hierarchical features. Our method
outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on
LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning
the whole model including the SSL model parameters.Comment: Accepted to ICASSP 202
An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification
In recent years, self-supervised learning paradigm has received extensive
attention due to its great success in various down-stream tasks. However, the
fine-tuning strategies for adapting those pre-trained models to speaker
verification task have yet to be fully explored. In this paper, we analyze
several feature extraction approaches built on top of a pre-trained model, as
well as regularization and learning rate schedule to stabilize the fine-tuning
process and further boost performance: multi-head factorized attentive pooling
is proposed to factorize the comparison of speaker representations into
multiple phonetic clusters. We regularize towards the parameters of the
pre-trained model and we set different learning rates for each layer of the
pre-trained model during fine-tuning. The experimental results show our method
can significantly shorten the training time to 4 hours and achieve SOTA
performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H,
respectively.Comment: Accepted by SLT202
Automatic language identification using deep neural networks
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. I. LĂłpez-Moreno, J. González-DomĂnguez, P. Oldrich, D. R. MartĂnez, J. González-RodrĂguez, "Automatic language identification using deep neural networks", IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, Florence (Italy), 2014This work studies the use of deep neural networks (DNNs)
to address automatic language identification (LID). Motivated
by their recent success in acoustic modelling, we adapt DNNs
to the problem of identifying the language of a given spoken
utterance from short-term acoustic features. The proposed approach
is compared to state-of-the-art i-vector based acoustic
systems on two different datasets: Google 5M LID corpus and
NIST LRE 2009. Results show how LID can largely benefit
from using DNNs, especially when a large amount of training
data is available. We found relative improvements up to 70%,
in Cavg, over the baseline system