37 research outputs found
Memory and computation trade-offs for efficient i-vector extraction
This work aims at reducing the memory demand of the data structures that are usually pre-computed and stored for fast computation of the i-vectors, a compact representation of spoken utterances that is used by most state-of-the-art speaker recognition systems. We propose two new approaches allowing accurate i-vector extraction but requiring less memory, showing their relations with the standard computation method introduced for eigenvoices, and with the recently proposed fast eigen-decomposition technique. The first approach computes an i-vector in a Variational Bayes (VB) framework by iterating the estimation of one sub-block of i-vector elements at a time, keeping fixed all the others, and can obtain i-vectors as accurate as the ones obtained by the standard technique but requiring only 25% of its memory. The second technique is based on the Conjugate Gradient solution of a linear system, which is accurate and uses even less memory, but is slower than the VB approach. We analyze and compare the time and memory resources required by all these solutions, which are suited to different applications, and we show that it is possible to get accurate results greatly reducing memory demand compared with the standard solution at almost the same speed
I–vector transformation and scaling for PLDA based speaker recognition
This paper proposes a density model transformation for speaker recognition systems based on i–vectors and Probabilistic Linear Discriminant Analysis (PLDA) classification. The PLDA model assumes that the i-vectors are distributed according to the standard normal distribution, whereas it is well known that this is not the case. Experiments have shown that the i–vector are better modeled, for example, by a Heavy–Tailed distribution, and that significant improvement of the classification performance can be obtained by whitening and length normalizing the i-vectors. In this work we propose to transform the i–vectors, extracted ignoring the classifier that will be used, so that their distribution becomes more suitable to discriminate speakers using PLDA. This is performed by means of a sequence of affine and non–linear transformations whose parameters are obtained by Maximum Likelihood (ML) estimation on the training set.
The second contribution of this work is the reduction of the mismatch between the development and test i–vector distributions by means of a scaling factor tuned for the estimated i-vector distribution, rather than by means of a blind length normalization.
Our tests performed on the NIST SRE-2010 and SRE-2012 evaluation sets show that improvement of their Cost Functions of the order of 10% can be obtained for both evaluation data
Memory and computation effective approaches for i-vector extraction
This paper focuses on the extraction of i-vectors, a compact representation of spoken utterances that is used by most of the state-of-the-art speaker recognition systems. This work was mainly motivated by the need of reducing the memory demand of the huge data structures that are usually precomputed for fast computation of the i-vectors. We propose a set of new approaches allowing accurate i-vector extraction but requiring less memory, showing their relations with the standard computation method introduced for eigenvoices. We analyze the time and memory resources required by these solutions, which are suited to different fields of application, and we show that it is possible to get accurate results with solutions that reduce both computation time and memory demand compared with the standard solutio
Graph Neural Network Backend for Speaker Recognition
Currently, most speaker recognition backends, such as cosine, linear
discriminant analysis (LDA), or probabilistic linear discriminant analysis
(PLDA), make decisions by calculating similarity or distance between enrollment
and test embeddings which are already extracted from neural networks. However,
for each embedding, the local structure of itself and its neighbor embeddings
in the low-dimensional space is different, which may be helpful for the
recognition but is often ignored. In order to take advantage of it, we propose
a graph neural network (GNN) backend to mine latent relationships among
embeddings for classification. We assume all the embeddings as nodes on a
graph, and their edges are computed based on some similarity function, such as
cosine, LDA+cosine, or LDA+PLDA. We study different graph settings and explore
variants of GNN to find a better message passing and aggregation way to
accomplish the recognition task. Experimental results on NIST SRE14 i-vector
challenging, VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets demonstrate
that our proposed GNN backends significantly outperform current mainstream
methods