4,997 research outputs found
A Unified Framework for Score Normalization Techniques Applied to Text Independent Speaker Verification
The purpose of this paper is to unify several of the state-of-the-art score normalization techniques applied to text-independent speaker verification systems. We propose a new framework for this purpose. The two well-known Z- and T-normalization techniques can be easily interpreted in this framework as different ways to estimate score distributions. This is useful as it helps to understand the various assumptions behind these well-known score normalization techniques, and opens the door for yet more complex solutions. Finally, some experiments on the Switchboard database are performed in order to illustrate the validity of the new proposed framework
A Generative Model for Score Normalization in Speaker Recognition
We propose a theoretical framework for thinking about score normalization,
which confirms that normalization is not needed under (admittedly fragile)
ideal conditions. If, however, these conditions are not met, e.g. under
data-set shift between training and runtime, our theory reveals dependencies
between scores that could be exploited by strategies such as score
normalization. Indeed, it has been demonstrated over and over experimentally,
that various ad-hoc score normalization recipes do work. We present a first
attempt at using probability theory to design a generative score-space
normalization model which gives similar improvements to ZT-norm on the
text-dependent RSR 2015 database
Speaker verification using sequence discriminant support vector machines
This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels. Score-space kernels generalize Fisher kernels and are based on underlying generative models such as Gaussian mixture models (GMMs). This approach provides direct discrimination between whole sequences, in contrast with the frame-level approaches at the heart of most current systems. The resultant SVMs have a very high dimensionality since it is related to the number of parameters in the underlying generative model. To address problems that arise in the resultant optimization we introduce a technique called spherical normalization that preconditions the Hessian matrix. We have performed speaker verification experiments using the PolyVar database. The SVM system presented here reduces the relative error rates by 34% compared to a GMM likelihood ratio system
Speaker recognition by means of restricted Boltzmann machine adaptation
Restricted Boltzmann Machines (RBMs) have shown success in speaker recognition. In this paper, RBMs are investigated in a framework comprising a universal model training and model adaptation. Taking advantage of RBM unsupervised learning algorithm, a global model is trained based on all available background data. This general speaker-independent model, referred to as URBM, is further adapted to the data of a specific speaker to build speaker-dependent model. In order to show its effectiveness, we have applied this framework to two different tasks. It has been used to discriminatively model target and impostor spectral features for classification. It has been also utilized to produce a vector-based representation for speakers. This vector-based representation, similar to i-vector, can be further used for speaker recognition using either cosine scoring or Probabilistic Linear Discriminant Analysis (PLDA). The evaluation is performed on the core test condition of the NIST SRE 2006 database.Peer ReviewedPostprint (author's final draft
Attentive Statistics Pooling for Deep Speaker Embedding
This paper proposes attentive statistics pooling for deep speaker embedding
in text-independent speaker verification. In conventional speaker embedding,
frame-level features are averaged over all the frames of a single utterance to
form an utterance-level feature. Our method utilizes an attention mechanism to
give different weights to different frames and generates not only weighted
means but also weighted standard deviations. In this way, it can capture
long-term variations in speaker characteristics more effectively. An evaluation
on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal
error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.Comment: Proc. Interspeech 2018, pp2252--2256. arXiv admin note: text overlap
with arXiv:1809.0931
- …