187 research outputs found
Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition
In this work, we continue in our research on i-vector extractor for speaker
verification (SV) and we optimize its architecture for fast and effective
discriminative training. We were motivated by computational and memory
requirements caused by the large number of parameters of the original
generative i-vector model. Our aim is to preserve the power of the original
generative model, and at the same time focus the model towards extraction of
speaker-related information. We show that it is possible to represent a
standard generative i-vector extractor by a model with significantly less
parameters and obtain similar performance on SV tasks. We can further refine
this compact model by discriminative training and obtain i-vectors that lead to
better performance on various SV benchmarks representing different acoustic
domains.Comment: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note:
substantial text overlap with arXiv:1810.1318
On deep speaker embeddings for text-independent speaker recognition
We investigate deep neural network performance in the textindependent speaker
recognition task. We demonstrate that using angular softmax activation at the
last classification layer of a classification neural network instead of a
simple softmax activation allows to train a more generalized discriminative
speaker embedding extractor. Cosine similarity is an effective metric for
speaker verification in this embedding space. We also address the problem of
choosing an architecture for the extractor. We found that deep networks with
residual frame level connections outperform wide but relatively shallow
architectures. This paper also proposes several improvements for previous
DNN-based extractor systems to increase the speaker recognition accuracy. We
show that the discriminatively trained similarity metric learning approach
outperforms the standard LDA-PLDA method as an embedding backend. The results
obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate
robustness of the proposed systems when dealing with close to real-life
conditions.Comment: Submitted to Odyssey 201
A Speaker Verification Backend with Robust Performance across Conditions
In this paper, we address the problem of speaker verification in conditions
unseen or unknown during development. A standard method for speaker
verification consists of extracting speaker embeddings with a deep neural
network and processing them through a backend composed of probabilistic linear
discriminant analysis (PLDA) and global logistic regression score calibration.
This method is known to result in systems that work poorly on conditions
different from those used to train the calibration model. We propose to modify
the standard backend, introducing an adaptive calibrator that uses duration and
other automatically extracted side-information to adapt to the conditions of
the inputs. The backend is trained discriminatively to optimize binary
cross-entropy. When trained on a number of diverse datasets that are labeled
only with respect to speaker, the proposed backend consistently and, in some
cases, dramatically improves calibration, compared to the standard PLDA
approach, on a number of held-out datasets, some of which are markedly
different from the training data. Discrimination performance is also
consistently improved. We show that joint training of the PLDA and the adaptive
calibrator is essential -- the same benefits cannot be achieved when freezing
PLDA and fine-tuning the calibrator. To our knowledge, the results in this
paper are the first evidence in the literature that it is possible to develop a
speaker verification system with robust out-of-the-box performance on a large
variety of conditions
Graph Neural Network Backend for Speaker Recognition
Currently, most speaker recognition backends, such as cosine, linear
discriminant analysis (LDA), or probabilistic linear discriminant analysis
(PLDA), make decisions by calculating similarity or distance between enrollment
and test embeddings which are already extracted from neural networks. However,
for each embedding, the local structure of itself and its neighbor embeddings
in the low-dimensional space is different, which may be helpful for the
recognition but is often ignored. In order to take advantage of it, we propose
a graph neural network (GNN) backend to mine latent relationships among
embeddings for classification. We assume all the embeddings as nodes on a
graph, and their edges are computed based on some similarity function, such as
cosine, LDA+cosine, or LDA+PLDA. We study different graph settings and explore
variants of GNN to find a better message passing and aggregation way to
accomplish the recognition task. Experimental results on NIST SRE14 i-vector
challenging, VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets demonstrate
that our proposed GNN backends significantly outperform current mainstream
methods
Deep learning backend for single and multisession i-vector speaker recognition
The lack of labeled background data makes a big performance gap between cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring baseline techniques for i-vectors in speaker recognition. Although there are some unsupervised clustering techniques to estimate the labels, they cannot accurately predict the true labels and they also assume that there are several samples from the same speaker in the background data that could not be true in reality. In this paper, the authors make use of Deep Learning (DL) to fill this performance gap given unlabeled background data. To this goal, the authors have proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on deep belief networks and deep neural networks to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single- and multisession speaker enrollment tasks, some experiments have been carried out in this paper in both scenarios. Experiments on National Institute of Standards and Technology 2014 i-vector challenge show that 46% of this performance gap, in terms of minimum of the decision cost function, is filled by the proposed DL-based system. Furthermore, the score combination of the proposed DL-based system and PLDA with estimated labels covers 79% of this gap.Peer ReviewedPostprint (published version
- …