187 research outputs found

    Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition

    Full text link
    In this work, we continue in our research on i-vector extractor for speaker verification (SV) and we optimize its architecture for fast and effective discriminative training. We were motivated by computational and memory requirements caused by the large number of parameters of the original generative i-vector model. Our aim is to preserve the power of the original generative model, and at the same time focus the model towards extraction of speaker-related information. We show that it is possible to represent a standard generative i-vector extractor by a model with significantly less parameters and obtain similar performance on SV tasks. We can further refine this compact model by discriminative training and obtain i-vectors that lead to better performance on various SV benchmarks representing different acoustic domains.Comment: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note: substantial text overlap with arXiv:1810.1318

    On deep speaker embeddings for text-independent speaker recognition

    Full text link
    We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustness of the proposed systems when dealing with close to real-life conditions.Comment: Submitted to Odyssey 201

    A Speaker Verification Backend with Robust Performance across Conditions

    Full text link
    In this paper, we address the problem of speaker verification in conditions unseen or unknown during development. A standard method for speaker verification consists of extracting speaker embeddings with a deep neural network and processing them through a backend composed of probabilistic linear discriminant analysis (PLDA) and global logistic regression score calibration. This method is known to result in systems that work poorly on conditions different from those used to train the calibration model. We propose to modify the standard backend, introducing an adaptive calibrator that uses duration and other automatically extracted side-information to adapt to the conditions of the inputs. The backend is trained discriminatively to optimize binary cross-entropy. When trained on a number of diverse datasets that are labeled only with respect to speaker, the proposed backend consistently and, in some cases, dramatically improves calibration, compared to the standard PLDA approach, on a number of held-out datasets, some of which are markedly different from the training data. Discrimination performance is also consistently improved. We show that joint training of the PLDA and the adaptive calibrator is essential -- the same benefits cannot be achieved when freezing PLDA and fine-tuning the calibrator. To our knowledge, the results in this paper are the first evidence in the literature that it is possible to develop a speaker verification system with robust out-of-the-box performance on a large variety of conditions

    Graph Neural Network Backend for Speaker Recognition

    Full text link
    Currently, most speaker recognition backends, such as cosine, linear discriminant analysis (LDA), or probabilistic linear discriminant analysis (PLDA), make decisions by calculating similarity or distance between enrollment and test embeddings which are already extracted from neural networks. However, for each embedding, the local structure of itself and its neighbor embeddings in the low-dimensional space is different, which may be helpful for the recognition but is often ignored. In order to take advantage of it, we propose a graph neural network (GNN) backend to mine latent relationships among embeddings for classification. We assume all the embeddings as nodes on a graph, and their edges are computed based on some similarity function, such as cosine, LDA+cosine, or LDA+PLDA. We study different graph settings and explore variants of GNN to find a better message passing and aggregation way to accomplish the recognition task. Experimental results on NIST SRE14 i-vector challenging, VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets demonstrate that our proposed GNN backends significantly outperform current mainstream methods

    Deep learning backend for single and multisession i-vector speaker recognition

    Get PDF
    The lack of labeled background data makes a big performance gap between cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring baseline techniques for i-vectors in speaker recognition. Although there are some unsupervised clustering techniques to estimate the labels, they cannot accurately predict the true labels and they also assume that there are several samples from the same speaker in the background data that could not be true in reality. In this paper, the authors make use of Deep Learning (DL) to fill this performance gap given unlabeled background data. To this goal, the authors have proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on deep belief networks and deep neural networks to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single- and multisession speaker enrollment tasks, some experiments have been carried out in this paper in both scenarios. Experiments on National Institute of Standards and Technology 2014 i-vector challenge show that 46% of this performance gap, in terms of minimum of the decision cost function, is filled by the proposed DL-based system. Furthermore, the score combination of the proposed DL-based system and PLDA with estimated labels covers 79% of this gap.Peer ReviewedPostprint (published version
    • …
    corecore