2,816 research outputs found
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Adversarial Speaker Adaptation
We propose a novel adversarial speaker adaptation (ASA) scheme, in which
adversarial learning is applied to regularize the distribution of deep hidden
features in a speaker-dependent (SD) deep neural network (DNN) acoustic model
to be close to that of a fixed speaker-independent (SI) DNN acoustic model
during adaptation. An additional discriminator network is introduced to
distinguish the deep features generated by the SD model from those produced by
the SI model. In ASA, with a fixed SI model as the reference, an SD model is
jointly optimized with the discriminator network to minimize the senone
classification loss, and simultaneously to mini-maximize the SI/SD
discrimination loss on the adaptation data. With ASA, a senone-discriminative
deep feature is learned in the SD model with a similar distribution to that of
the SI model. With such a regularized and adapted deep feature, the SD model
can perform improved automatic speech recognition on the target speaker's
speech. Evaluated on the Microsoft short message dictation dataset, ASA
achieves 14.4% and 7.9% relative word error rate improvements for supervised
and unsupervised adaptation, respectively, over an SI model trained from 2600
hours data, with 200 adaptation utterances per speaker.Comment: 5 pages, 2 figures, ICASSP 201
Attentive Adversarial Learning for Domain-Invariant Training
Adversarial domain-invariant training (ADIT) proves to be effective in
suppressing the effects of domain variability in acoustic modeling and has led
to improved performance in automatic speech recognition (ASR). In ADIT, an
auxiliary domain classifier takes in equally-weighted deep features from a deep
neural network (DNN) acoustic model and is trained to improve their
domain-invariance by optimizing an adversarial loss function. In this work, we
propose an attentive ADIT (AADIT) in which we advance the domain classifier
with an attention mechanism to automatically weight the input deep features
according to their importance in domain classification. With this attentive
re-weighting, AADIT can focus on the domain normalization of phonetic
components that are more susceptible to domain variability and generates deep
features with improved domain-invariance and senone-discriminativity over ADIT.
Most importantly, the attention block serves only as an external component to
the DNN acoustic model and is not involved in ASR, so AADIT can be used to
improve the acoustic modeling with any DNN architectures. More generally, the
same methodology can improve any adversarial learning system with an auxiliary
discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3%
relative WER improvements, respectively, over a multi-conditional model and a
strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201
Adversarial Network Bottleneck Features for Noise Robust Speaker Verification
In this paper, we propose a noise robust bottleneck feature representation
which is generated by an adversarial network (AN). The AN includes two cascade
connected networks, an encoding network (EN) and a discriminative network (DN).
Mel-frequency cepstral coefficients (MFCCs) of clean and noisy speech are used
as input to the EN and the output of the EN is used as the noise robust
feature. The EN and DN are trained in turn, namely, when training the DN, noise
types are selected as the training labels and when training the EN, all labels
are set as the same, i.e., the clean speech label, which aims to make the AN
features invariant to noise and thus achieve noise robustness. We evaluate the
performance of the proposed feature on a Gaussian Mixture Model-Universal
Background Model based speaker verification system, and make comparison to MFCC
features of speech enhanced by short-time spectral amplitude minimum mean
square error (STSA-MMSE) and deep neural network-based speech enhancement
(DNN-SE) methods. Experimental results on the RSR2015 database show that the
proposed AN bottleneck feature (AN-BN) dramatically outperforms the STSA-MMSE
and DNN-SE based MFCCs for different noise types and signal-to-noise ratios.
Furthermore, the AN-BN feature is able to improve the speaker verification
performance under the clean condition
DNN adaptation by automatic quality estimation of ASR hypotheses
In this paper we propose to exploit the automatic Quality Estimation (QE) of
ASR hypotheses to perform the unsupervised adaptation of a deep neural network
modeling acoustic probabilities. Our hypothesis is that significant
improvements can be achieved by: i)automatically transcribing the evaluation
data we are currently trying to recognise, and ii) selecting from it a subset
of "good quality" instances based on the word error rate (WER) scores predicted
by a QE component. To validate this hypothesis, we run several experiments on
the evaluation data sets released for the CHiME-3 challenge. First, we operate
in oracle conditions in which manual transcriptions of the evaluation data are
available, thus allowing us to compute the "true" sentence WER. In this
scenario, we perform the adaptation with variable amounts of data, which are
characterised by different levels of quality. Then, we move to realistic
conditions in which the manual transcriptions of the evaluation data are not
available. In this case, the adaptation is performed on data selected according
to the WER scores "predicted" by a QE component. Our results indicate that: i)
QE predictions allow us to closely approximate the adaptation results obtained
in oracle conditions, and ii) the overall ASR performance based on the proposed
QE-driven adaptation method is significantly better than the strong, most
recent, CHiME-3 baseline.Comment: Computer Speech & Language December 201
- …