3,098 research outputs found
Generative pairwise models for speaker recognition
This paper proposes a simple model for speaker recognition based on i-vector pairs, and analyzes its similarity and differences with respect to the state-of-the-art Probabilistic Linear Discriminant Analysis (PLDA) and Pairwise Support Vector Machine (PSVM) models. Similar to the discriminative PSVM approach, we propose a generative model of i-vector pairs, rather than an usual i-vector based model. The model is based on two Gaussian distributions, one for the "same speakers" and the other for the "different speakers" i-vector pairs, and on the assumption that the i-vector pairs are independent. This independence assumption allows the distributions of the two classes to be independently estimated. The "Two-Gaussian" approach can be extended to the Heavy-Tailed distributions, still allowing a fast closed form solution to be obtained for testing i-vector pairs. We show that this model is closely related to PLDA and to PSVM models, and that tested on the female part of the tel-tel NIST SRE 2010 extended evaluation set, it is able to achieve comparable accuracy with respect to the other models, trained with different objective functions and training procedure
Generative pairwise models for speaker recognition
This paper proposes a simple model for speaker recognition
based on i–vector pairs, and analyzes its similarity and differences
with respect to the state–of–the–art Probabilistic Linear
Discriminant Analysis (PLDA) and Pairwise Support Vector
Machine (PSVM) models. Similar to the discriminative PSVM
approach, we propose a generative model of i–vector pairs,
rather than an usual i–vector based model. The model is based
on two Gaussian distributions, one for the “same speakers” and
the other for the “different speakers” i–vector pairs, and on the
assumption that the i–vector pairs are independent. This independence assumption allows the distributions of the two classes to be independently estimated. The “Two–Gaussian” approach can be extended to the Heavy–Tailed distributions, still allowing a fast closed form solution to be obtained for testing i–vector pairs.
We show that this model is closely related to PLDA and to PSVM models, and that tested on the female part of the tel–tel NIST SRE 2010 extended evaluation set, it is able to achieve comparable accuracy with respect to the other models, trained with different objective functions and training procedures
NPLDA: A Deep Neural PLDA Model for Speaker Verification
The state-of-art approach for speaker verification consists of a neural
network based embedding extractor along with a backend generative model such as
the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose
a neural network approach for backend modeling in speaker recognition. The
likelihood ratio score of the generative PLDA model is posed as a
discriminative similarity function and the learnable parameters of the score
function are optimized using a verification cost. The proposed model, termed as
neural PLDA (NPLDA), is initialized using the generative PLDA model parameters.
The loss function for the NPLDA model is an approximation of the minimum
detection cost function (DCF). The speaker recognition experiments using the
NPLDA model are performed on the speaker verificiation task in the VOiCES
datasets as well as the SITW challenge dataset. In these experiments, the NPLDA
model optimized using the proposed loss function improves significantly over
the state-of-art PLDA based speaker verification system.Comment: Published in Odyssey 2020, the Speaker and Language Recognition
Workshop (VOiCES Special Session). Link to GitHub Implementation:
https://github.com/iiscleap/NeuralPlda. arXiv admin note: substantial text
overlap with arXiv:2001.0703
Homomorphic Encryption for Speaker Recognition: Protection of Biometric Templates and Vendor Model Parameters
Data privacy is crucial when dealing with biometric data. Accounting for the
latest European data privacy regulation and payment service directive,
biometric template protection is essential for any commercial application.
Ensuring unlinkability across biometric service operators, irreversibility of
leaked encrypted templates, and renewability of e.g., voice models following
the i-vector paradigm, biometric voice-based systems are prepared for the
latest EU data privacy legislation. Employing Paillier cryptosystems, Euclidean
and cosine comparators are known to ensure data privacy demands, without loss
of discrimination nor calibration performance. Bridging gaps from template
protection to speaker recognition, two architectures are proposed for the
two-covariance comparator, serving as a generative model in this study. The
first architecture preserves privacy of biometric data capture subjects. In the
second architecture, model parameters of the comparator are encrypted as well,
such that biometric service providers can supply the same comparison modules
employing different key pairs to multiple biometric service operators. An
experimental proof-of-concept and complexity analysis is carried out on the
data from the 2013-2014 NIST i-vector machine learning challenge
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
In this paper, we explore the encoding/pooling layer and loss function in the
end-to-end speaker and language recognition system. First, a unified and
interpretable end-to-end system for both speaker and language recognition is
developed. It accepts variable-length input and produces an utterance level
result. In the end-to-end system, the encoding layer plays a role in
aggregating the variable-length input sequence into an utterance level
representation. Besides the basic temporal average pooling, we introduce a
self-attentive pooling layer and a learnable dictionary encoding layer to get
the utterance level representation. In terms of loss function for open-set
speaker verification, to get more discriminative speaker embedding, center loss
and angular softmax loss is introduced in the end-to-end system. Experimental
results on Voxceleb and NIST LRE 07 datasets show that the performance of
end-to-end learning system could be significantly improved by the proposed
encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201
- …