2 research outputs found
UNSUPERVISED DOMAIN ADAPTATION FOR SPEAKER VERIFICATION IN THE WILD
Performance of automatic speaker verification (ASV) systems is very sensitive
to mismatch between training (source) and testing (target) domains. The
best way to address domain mismatch is to perform matched condition training
– gather sufficient labeled samples from the target domain and use them in
training. However, in many cases this is too expensive or impractical. Usually,
gaining access to unlabeled target domain data, e.g., from open source online
media, and labeled data from other domains is more feasible. This work focuses
on making ASV systems robust to uncontrolled (‘wild’) conditions, with
the help of some unlabeled data acquired from such conditions.
Given acoustic features from both domains, we propose learning a mapping
function – a deep convolutional neural network (CNN) with an encoder-decoder
architecture – between features of both the domains. We explore training the
network in two different scenarios: training on paired speech samples from
both domains and training on unpaired data. In the former case, where the
paired data is usually obtained via simulation, the CNN is treated as a nonii
ABSTRACT
linear regression function and is trained to minimize L2 loss between original
and predicted features from target domain. We provide empirical evidence that
this approach introduces distortions that affect verification performance. To
address this, we explore training the CNN using adversarial loss (along with
L2), which makes the predicted features indistinguishable from the original
ones, and thus, improve verification performance.
The above framework using simulated paired data, though effective, cannot
be used to train the network on unpaired data obtained by independently
sampling speech from both domains. In this case, we first train a CNN using
adversarial loss to map features from target to source. We, then, map the
predicted features back to the target domain using an auxiliary network, and
minimize a cycle-consistency loss between the original and reconstructed target
features.
Our unsupervised adaptation approach complements its supervised counterpart,
where adaptation is done using labeled data from both domains. We
focus on three domain mismatch scenarios: (1) sampling frequency mismatch
between the domains, (2) channel mismatch, and (3) robustness to far-field and
noisy speech acquired from wild conditions