7 research outputs found
VAE-based regularization for deep speaker embedding
Deep speaker embedding has achieved state-of-the-art performance in speaker
recognition. A potential problem of these embedded vectors (called `x-vectors')
are not Gaussian, causing performance degradation with the famous PLDA back-end
scoring. In this paper, we propose a regularization approach based on
Variational Auto-Encoder (VAE). This model transforms x-vectors to a latent
space where mapped latent codes are more Gaussian, hence more suitable for PLDA
scoring
Neural Discriminant Analysis for Deep Speaker Embedding
Probabilistic Linear Discriminant Analysis (PLDA) is a popular tool in
open-set classification/verification tasks. However, the Gaussian assumption
underlying PLDA prevents it from being applied to situations where the data is
clearly non-Gaussian. In this paper, we present a novel nonlinear version of
PLDA named as Neural Discriminant Analysis (NDA). This model employs an
invertible deep neural network to transform a complex distribution to a simple
Gaussian, so that the linear Gaussian model can be readily established in the
transformed space. We tested this NDA model on a speaker recognition task where
the deep speaker vectors (x-vectors) are presumably non-Gaussian. Experimental
results on two datasets demonstrate that NDA consistently outperforms PLDA, by
handling the non-Gaussian distributions of the x-vectors.Comment: submitted to INTERSPEECH 202
A Robust Speaker Clustering Method Based on Discrete Tied Variational Autoencoder
Recently, the speaker clustering model based on aggregation hierarchy cluster
(AHC) is a common method to solve two main problems: no preset category number
clustering and fix category number clustering. In general, model takes features
like i-vectors as input of probability and linear discriminant analysis model
(PLDA) aims to form the distance matric in long voice application scenario, and
then clustering results are obtained through the clustering model. However,
traditional speaker clustering method based on AHC has the shortcomings of
long-time running and remains sensitive to environment noise. In this paper, we
propose a novel speaker clustering method based on Mutual Information (MI) and
a non-linear model with discrete variable, which under the enlightenment of
Tied Variational Autoencoder (TVAE), to enhance the robustness against noise.
The proposed method named Discrete Tied Variational Autoencoder (DTVAE) which
shortens the elapsed time substantially. With experience results, it
outperforms the general model and yields a relative Accuracy (ACC) improvement
and significant time reduction.Comment: will be presented in ICASSP 202
VAE-based Domain Adaptation for Speaker Verification
Deep speaker embedding has achieved satisfactory performance in speaker
verification. By enforcing the neural model to discriminate the speakers in the
training set, deep speaker embedding (called `x-vectors`) can be derived from
the hidden layers. Despite its good performance, the present embedding model is
highly domain sensitive, which means that it often works well in domains whose
acoustic condition matches that of the training data (in-domain), but degrades
in mismatched domains (out-of-domain). In this paper, we present a domain
adaptation approach based on Variational Auto-Encoder (VAE). This model
transforms x-vectors to a regularized latent space; within this latent space, a
small amount of data from the target domain is sufficient to accomplish the
adaptation. Our experiments demonstrated that by this VAE-adaptation approach,
speaker embeddings can be easily transformed to the target domain, leading to
noticeable performance improvement
Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal
Speech signal is constituted and contributed by various informative factors,
such as linguistic content and speaker characteristic. There have been notable
recent studies attempting to factorize speech signal into these individual
factors without requiring any annotation. These studies typically assume
continuous representation for linguistic content, which is not in accordance
with general linguistic knowledge and may make the extraction of speaker
information less successful. This paper proposes the mixture factorized
auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of
mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer
models linguistic content of input speech with a discrete categorical
distribution. It performs frame clustering by assigning each frame a soft
mixture label. The utterance embedder generates an utterance-level vector
representation. A frame decoder serves to reconstruct speech features from the
encoders'outputs. The mFAE is evaluated on speaker verification (SV) task and
unsupervised subword modeling (USM) task. The SV experiments on VoxCeleb 1 show
that the utterance embedder is capable of extracting speaker-discriminative
embeddings with performance comparable to a x-vector baseline. The USM
experiments on ZeroSpeech 2017 dataset verify that the frame tokenizer is able
to capture linguistic content and the utterance embedder can acquire
speaker-related information
Masked Proxy Loss For Text-Independent Speaker Verification
Open-set speaker recognition can be regarded as a metric learning problem,
which is to maximize inter-class variance and minimize intra-class variance.
Supervised metric learning can be categorized into entity-based learning and
proxy-based learning. Most of the existing metric learning objectives like
Contrastive, Triplet, Prototypical, GE2E, etc all belong to the former
division, the performance of which is either highly dependent on sample mining
strategy or restricted by insufficient label information in the mini-batch.
Proxy-based losses mitigate both shortcomings, however, fine-grained
connections among entities are either not or indirectly leveraged. This paper
proposes a Masked Proxy (MP) loss which directly incorporates both proxy-based
relationships and pair-based relationships. We further propose Multinomial
Masked Proxy (MMP) loss to leverage the hardness of speaker pairs. These
methods have been applied to evaluate on VoxCeleb test set and reach
state-of-the-art Equal Error Rate(EER).Comment: Accepted at Interspeech 202
Deep Normalization for Speaker Vectors
Deep speaker embedding has demonstrated state-of-the-art performance in
speaker recognition tasks. However, one potential issue with this approach is
that the speaker vectors derived from deep embedding models tend to be
non-Gaussian for each individual speaker, and non-homogeneous for distributions
of different speakers. These irregular distributions can seriously impact
speaker recognition performance, especially with the popular PLDA scoring
method, which assumes homogeneous Gaussian distribution. In this paper, we
argue that deep speaker vectors require deep normalization, and propose a deep
normalization approach based on a novel discriminative normalization flow (DNF)
model. We demonstrate the effectiveness of the proposed approach with
experiments using the widely used SITW and CNCeleb corpora. In these
experiments, the DNF-based normalization delivered substantial performance
gains and also showed strong generalization capability in out-of-domain tests