1,052 research outputs found
Adversarial Speaker Adaptation
We propose a novel adversarial speaker adaptation (ASA) scheme, in which
adversarial learning is applied to regularize the distribution of deep hidden
features in a speaker-dependent (SD) deep neural network (DNN) acoustic model
to be close to that of a fixed speaker-independent (SI) DNN acoustic model
during adaptation. An additional discriminator network is introduced to
distinguish the deep features generated by the SD model from those produced by
the SI model. In ASA, with a fixed SI model as the reference, an SD model is
jointly optimized with the discriminator network to minimize the senone
classification loss, and simultaneously to mini-maximize the SI/SD
discrimination loss on the adaptation data. With ASA, a senone-discriminative
deep feature is learned in the SD model with a similar distribution to that of
the SI model. With such a regularized and adapted deep feature, the SD model
can perform improved automatic speech recognition on the target speaker's
speech. Evaluated on the Microsoft short message dictation dataset, ASA
achieves 14.4% and 7.9% relative word error rate improvements for supervised
and unsupervised adaptation, respectively, over an SI model trained from 2600
hours data, with 200 adaptation utterances per speaker.Comment: 5 pages, 2 figures, ICASSP 201
Conditional Teacher-Student Learning
The teacher-student (T/S) learning has been shown to be effective for a
variety of problems such as domain adaptation and model compression. One
shortcoming of the T/S learning is that a teacher model, not always perfect,
sporadically produces wrong guidance in form of posterior probabilities that
misleads the student model towards a suboptimal performance. To overcome this
problem, we propose a conditional T/S learning scheme, in which a "smart"
student model selectively chooses to learn from either the teacher model or the
ground truth labels conditioned on whether the teacher can correctly predict
the ground truth. Unlike a naive linear combination of the two knowledge
sources, the conditional learning is exclusively engaged with the teacher model
when the teacher model's prediction is correct, and otherwise backs off to the
ground truth. Thus, the student model is able to learn effectively from the
teacher and even potentially surpass the teacher. We examine the proposed
learning scheme on two tasks: domain adaptation on CHiME-3 dataset and speaker
adaptation on Microsoft short message dictation dataset. The proposed method
achieves 9.8% and 12.8% relative word error rate reductions, respectively, over
T/S learning for environment adaptation and speaker-independent model for
speaker adaptation.Comment: 5 pages, 1 figure, ICASSP 201
A target guided subband filter for acoustic event detection in noisy environments using wavelet packets
This paper deals with acoustic event detection (AED), such as screams, gunshots, and explosions, in noisy environments. The main aim is to improve the detection performance under adverse conditions with a very low signal-to-noise ratio (SNR). A novel filtering method combined with an energy detector is presented. The wavelet packet transform (WPT) is first used for time-frequency representation of the acoustic signals. The proposed filter in the wavelet packet domain then uses a priori knowledge of the target event and an estimate of noise features to selectively suppress the background noise. It is in fact a content-aware band-pass filter which can automatically pass the frequency bands that are more significant in the target than in the noise. Theoretical analysis shows that the proposed filtering method is capable of enhancing the target content while suppressing the background noise for signals with a low SNR. A condition to increase the probability of correct detection is also obtained. Experiments have been carried out on a large dataset of acoustic events that are contaminated by different types of environmental noise and white noise with varying SNRs. Results show that the proposed method is more robust and better adapted to noise than ordinary energy detectors, and it can work even with an SNR as low as -15 dB. A practical system for real time processing and multi-target detection is also proposed in this work
- …