3 research outputs found
The VOiCES from a Distance Challenge 2019 Evaluation Plan
The "VOiCES from a Distance Challenge 2019" is designed to foster research in
the area of speaker recognition and automatic speech recognition (ASR) with the
special focus on single channel distant/far-field audio, under noisy
conditions. The main objectives of this challenge are to: (i) benchmark
state-of-the-art technology in the area of speaker recognition and automatic
speech recognition (ASR), (ii) support the development of new ideas and
technologies in speaker recognition and ASR, (iii) support new research groups
entering the field of distant/far-field speech processing, and (iv) provide a
new, publicly available dataset to the community that exhibits realistic
distance characteristics.Comment: Special Session for Interspeech 201
Data augmentation versus noise compensation for x- vector speaker recognition systems in noisy environments
The explosion of available speech data and new speaker modeling methods based
on deep neural networks (DNN) have given the ability to develop more robust
speaker recognition systems. Among DNN speaker modelling techniques, x-vector
system has shown a degree of robustness in noisy environments. Previous studies
suggest that by increasing the number of speakers in the training data and
using data augmentation more robust speaker recognition systems are achievable
in noisy environments. In this work, we want to know if explicit noise
compensation techniques continue to be effective despite the general noise
robustness of these systems. For this study, we will use two different x-vector
networks: the first one is trained on Voxceleb1 (Protocol1), and the second one
is trained on Voxceleb1+Voxveleb2 (Protocol2). We propose to add a denoising
x-vector subsystem before scoring. Experimental results show that, the x-vector
system used in Protocol2 is more robust than the other one used Protocol1.
Despite this observation we will show that explicit noise compensation gives
almost the same EER relative gain in both protocols. For example, in the
Protocol2 we have 21% to 66% improvement of EER with denoising techniques
An empirical analysis of information encoded in disentangled neural speaker representations
The primary characteristic of robust speaker representations is that they are
invariant to factors of variability not related to speaker identity.
Disentanglement of speaker representations is one of the techniques used to
improve robustness of speaker representations to both intrinsic factors that
are acquired during speech production (e.g., emotion, lexical content) and
extrinsic factors that are acquired during signal capture (e.g., channel,
noise). Disentanglement in neural speaker representations can be achieved
either in a supervised fashion with annotations of the nuisance factors
(factors not related to speaker identity) or in an unsupervised fashion without
labels of the factors to be removed. In either case it is important to
understand the extent to which the various factors of variability are entangled
in the representations. In this work, we examine speaker representations with
and without unsupervised disentanglement for the amount of information they
capture related to a suite of factors. Using classification experiments we
provide empirical evidence that disentanglement reduces the information with
respect to nuisance factors from speaker representations, while retaining
speaker information. This is further validated by speaker verification
experiments on the VOiCES corpus in several challenging acoustic conditions. We
also show improved robustness in speaker verification tasks using data
augmentation during training of disentangled speaker embeddings. Finally, based
on our findings, we provide insights into the factors that can be effectively
separated using the unsupervised disentanglement technique and discuss
potential future directions.Comment: Submitted to Speaker Odyssey 202