1,145 research outputs found
Knowledge Distillation for Small-footprint Highway Networks
Deep learning has significantly advanced state-of-the-art of speech
recognition in the past few years. However, compared to conventional Gaussian
mixture acoustic models, neural network models are usually much larger, and are
therefore not very deployable in embedded devices. Previously, we investigated
a compact highway deep neural network (HDNN) for acoustic modelling, which is a
type of depth-gated feedforward neural network. We have shown that HDNN-based
acoustic models can achieve comparable recognition accuracy with much smaller
number of model parameters compared to plain deep neural network (DNN) acoustic
models. In this paper, we push the boundary further by leveraging on the
knowledge distillation technique that is also known as {\it teacher-student}
training, i.e., we train the compact HDNN model with the supervision of a high
accuracy cumbersome model. Furthermore, we also investigate sequence training
and adaptation in the context of teacher-student training. Our experiments were
performed on the AMI meeting speech recognition corpus. With this technique, we
significantly improved the recognition accuracy of the HDNN acoustic model with
less than 0.8 million parameters, and narrowed the gap between this model and
the plain DNN with 30 million parameters.Comment: 5 pages, 2 figures, accepted to icassp 201
Conditional Teacher-Student Learning
The teacher-student (T/S) learning has been shown to be effective for a
variety of problems such as domain adaptation and model compression. One
shortcoming of the T/S learning is that a teacher model, not always perfect,
sporadically produces wrong guidance in form of posterior probabilities that
misleads the student model towards a suboptimal performance. To overcome this
problem, we propose a conditional T/S learning scheme, in which a "smart"
student model selectively chooses to learn from either the teacher model or the
ground truth labels conditioned on whether the teacher can correctly predict
the ground truth. Unlike a naive linear combination of the two knowledge
sources, the conditional learning is exclusively engaged with the teacher model
when the teacher model's prediction is correct, and otherwise backs off to the
ground truth. Thus, the student model is able to learn effectively from the
teacher and even potentially surpass the teacher. We examine the proposed
learning scheme on two tasks: domain adaptation on CHiME-3 dataset and speaker
adaptation on Microsoft short message dictation dataset. The proposed method
achieves 9.8% and 12.8% relative word error rate reductions, respectively, over
T/S learning for environment adaptation and speaker-independent model for
speaker adaptation.Comment: 5 pages, 1 figure, ICASSP 201
Efficient Deep Learning in Network Compression and Acceleration
While deep learning delivers state-of-the-art accuracy on many artificial intelligence tasks, it comes at the cost of high computational complexity due to large parameters. It is important to design or develop efficient methods to support deep learning toward enabling its scalable deployment, particularly for embedded devices such as mobile, Internet of things (IOT), and drones. In this chapter, I will present a comprehensive survey of several advanced approaches for efficient deep learning in network compression and acceleration. I will describe the central ideas behind each approach and explore the similarities and differences between different methods. Finally, I will present some future directions in this field
- …