19 research outputs found
Conditional Teacher-Student Learning
The teacher-student (T/S) learning has been shown to be effective for a
variety of problems such as domain adaptation and model compression. One
shortcoming of the T/S learning is that a teacher model, not always perfect,
sporadically produces wrong guidance in form of posterior probabilities that
misleads the student model towards a suboptimal performance. To overcome this
problem, we propose a conditional T/S learning scheme, in which a "smart"
student model selectively chooses to learn from either the teacher model or the
ground truth labels conditioned on whether the teacher can correctly predict
the ground truth. Unlike a naive linear combination of the two knowledge
sources, the conditional learning is exclusively engaged with the teacher model
when the teacher model's prediction is correct, and otherwise backs off to the
ground truth. Thus, the student model is able to learn effectively from the
teacher and even potentially surpass the teacher. We examine the proposed
learning scheme on two tasks: domain adaptation on CHiME-3 dataset and speaker
adaptation on Microsoft short message dictation dataset. The proposed method
achieves 9.8% and 12.8% relative word error rate reductions, respectively, over
T/S learning for environment adaptation and speaker-independent model for
speaker adaptation.Comment: 5 pages, 1 figure, ICASSP 201
Speaker Adaptation for Attention-Based End-to-End Speech Recognition
We propose three regularization-based speaker adaptation approaches to adapt
the attention-based encoder-decoder (AED) model with very limited adaptation
data from target speakers for end-to-end automatic speech recognition. The
first method is Kullback-Leibler divergence (KLD) regularization, in which the
output distribution of a speaker-dependent (SD) AED is forced to be close to
that of the speaker-independent (SI) model by adding a KLD regularization to
the adaptation criterion. To compensate for the asymmetric deficiency in KLD
regularization, an adversarial speaker adaptation (ASA) method is proposed to
regularize the deep-feature distribution of the SD AED through the adversarial
learning of an auxiliary discriminator and the SD AED. The third approach is
the multi-task learning, in which an SD AED is trained to jointly perform the
primary task of predicting a large number of output units and an auxiliary task
of predicting a small number of output units to alleviate the target sparsity
issue. Evaluated on a Microsoft short message dictation task, all three methods
are highly effective in adapting the AED model, achieving up to 12.2% and 3.0%
word error rate improvement over an SI AED trained from 3400 hours data for
supervised and unsupervised adaptation, respectively.Comment: 5 pages, 3 figures, Interspeech 201
Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification
In this paper, we propose a domain adaptation framework to address the device
mismatch issue in acoustic scene classification leveraging upon neural label
embedding (NLE) and relational teacher student learning (RTSL). Taking into
account the structural relationships between acoustic scene classes, our
proposed framework captures such relationships which are intrinsically
device-independent. In the training stage, transferable knowledge is condensed
in NLE from the source domain. Next in the adaptation stage, a novel RTSL
strategy is adopted to learn adapted target models without using paired
source-target data often required in conventional teacher student learning. The
proposed framework is evaluated on the DCASE 2018 Task1b data set. Experimental
results based on AlexNet-L deep classification models confirm the effectiveness
of our proposed approach for mismatch situations. NLE-alone adaptation compares
favourably with the conventional device adaptation and teacher student based
adaptation techniques. NLE with RTSL further improves the classification
accuracy.Comment: Accepted by Interspeech 202
Convolutional Neural Networks for Speech Controlled Prosthetic Hands
Speech recognition is one of the key topics in artificial intelligence, as it
is one of the most common forms of communication in humans. Researchers have
developed many speech-controlled prosthetic hands in the past decades,
utilizing conventional speech recognition systems that use a combination of
neural network and hidden Markov model. Recent advancements in general-purpose
graphics processing units (GPGPUs) enable intelligent devices to run deep
neural networks in real-time. Thus, state-of-the-art speech recognition systems
have rapidly shifted from the paradigm of composite subsystems optimization to
the paradigm of end-to-end optimization. However, a low-power embedded GPGPU
cannot run these speech recognition systems in real-time. In this paper, we
show the development of deep convolutional neural networks (CNN) for speech
control of prosthetic hands that run in real-time on a NVIDIA Jetson TX2
developer kit. First, the device captures and converts speech into 2D features
(like spectrogram). The CNN receives the 2D features and classifies the hand
gestures. Finally, the hand gesture classes are sent to the prosthetic hand
motion control system. The whole system is written in Python with Keras, a deep
learning library that has a TensorFlow backend. Our experiments on the CNN
demonstrate the 91% accuracy and 2ms running time of hand gestures (text
output) from speech commands, which can be used to control the prosthetic hands
in real-time.Comment: 2019 First International Conference on Transdisciplinary AI
(TransAI), Laguna Hills, California, USA, 2019, pp. 35-4