223 research outputs found
A Speaker Verification Backend with Robust Performance across Conditions
In this paper, we address the problem of speaker verification in conditions
unseen or unknown during development. A standard method for speaker
verification consists of extracting speaker embeddings with a deep neural
network and processing them through a backend composed of probabilistic linear
discriminant analysis (PLDA) and global logistic regression score calibration.
This method is known to result in systems that work poorly on conditions
different from those used to train the calibration model. We propose to modify
the standard backend, introducing an adaptive calibrator that uses duration and
other automatically extracted side-information to adapt to the conditions of
the inputs. The backend is trained discriminatively to optimize binary
cross-entropy. When trained on a number of diverse datasets that are labeled
only with respect to speaker, the proposed backend consistently and, in some
cases, dramatically improves calibration, compared to the standard PLDA
approach, on a number of held-out datasets, some of which are markedly
different from the training data. Discrimination performance is also
consistently improved. We show that joint training of the PLDA and the adaptive
calibrator is essential -- the same benefits cannot be achieved when freezing
PLDA and fine-tuning the calibrator. To our knowledge, the results in this
paper are the first evidence in the literature that it is possible to develop a
speaker verification system with robust out-of-the-box performance on a large
variety of conditions
NPLDA: A Deep Neural PLDA Model for Speaker Verification
The state-of-art approach for speaker verification consists of a neural
network based embedding extractor along with a backend generative model such as
the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose
a neural network approach for backend modeling in speaker recognition. The
likelihood ratio score of the generative PLDA model is posed as a
discriminative similarity function and the learnable parameters of the score
function are optimized using a verification cost. The proposed model, termed as
neural PLDA (NPLDA), is initialized using the generative PLDA model parameters.
The loss function for the NPLDA model is an approximation of the minimum
detection cost function (DCF). The speaker recognition experiments using the
NPLDA model are performed on the speaker verificiation task in the VOiCES
datasets as well as the SITW challenge dataset. In these experiments, the NPLDA
model optimized using the proposed loss function improves significantly over
the state-of-art PLDA based speaker verification system.Comment: Published in Odyssey 2020, the Speaker and Language Recognition
Workshop (VOiCES Special Session). Link to GitHub Implementation:
https://github.com/iiscleap/NeuralPlda. arXiv admin note: substantial text
overlap with arXiv:2001.0703
I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences
The I4U consortium was established to facilitate a joint entry to NIST
speaker recognition evaluations (SRE). The latest edition of such joint
submission was in SRE 2018, in which the I4U submission was among the
best-performing systems. SRE'18 also marks the 10-year anniversary of I4U
consortium into NIST SRE series of evaluation. The primary objective of the
current paper is to summarize the results and lessons learned based on the
twelve sub-systems and their fusion submitted to SRE'18. It is also our
intention to present a shared view on the advancements, progresses, and major
paradigm shifts that we have witnessed as an SRE participant in the past decade
from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm
shift from supervector representation to deep speaker embedding, and a switch
of research challenge from channel compensation to domain adaptation.Comment: 5 page
I4U System Description for NIST SRE'20 CTS Challenge
This manuscript describes the I4U submission to the 2020 NIST Speaker
Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS)
Challenge. The I4U's submission was resulted from active collaboration among
researchers across eight research teams - IR (Singapore), UEF (Finland),
VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS
(Singapore), INRIA (France) and TJU (China). The submission was based on the
fusion of top performing sub-systems and sub-fusion systems contributed by
individual teams. Efforts have been spent on the use of common development and
validation sets, submission schedule and milestone, minimizing inconsistency in
trial list and score file format across sites.Comment: SRE 2021, NIST Speaker Recognition Evaluation Workshop, CTS Speaker
Recognition Challenge, 14-12 December 202
The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016
The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE'16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.Peer reviewe
- …