10 research outputs found
Improved training for online end-to-end speech recognition systems
Achieving high accuracy with end-to-end speech recognizers requires careful
parameter initialization prior to training. Otherwise, the networks may fail to
find a good local optimum. This is particularly true for online networks, such
as unidirectional LSTMs. Currently, the best strategy to train such systems is
to bootstrap the training from a tied-triphone system. However, this is time
consuming, and more importantly, is impossible for languages without a
high-quality pronunciation lexicon. In this work, we propose an initialization
strategy that uses teacher-student learning to transfer knowledge from a large,
well-trained, offline end-to-end speech recognition model to an online
end-to-end model, eliminating the need for a lexicon or any other linguistic
resources. We also explore curriculum learning and label smoothing and show how
they can be combined with the proposed teacher-student learning for further
improvements. We evaluate our methods on a Microsoft Cortana personal assistant
task and show that the proposed method results in a 19 % relative improvement
in word error rate compared to a randomly-initialized baseline system.Comment: Interspeech 201
Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation
Conventional automatic speech recognition (ASR) systems trained from
frame-level alignments can easily leverage posterior fusion to improve ASR
accuracy and build a better single model with knowledge distillation.
End-to-end ASR systems trained using the Connectionist Temporal Classification
(CTC) loss do not require frame-level alignment and hence simplify model
training. However, sparse and arbitrary posterior spike timings from CTC models
pose a new set of challenges in posterior fusion from multiple models and
knowledge distillation between CTC models. We propose a method to train a CTC
model so that its spike timings are guided to align with those of a pre-trained
guiding CTC model. As a result, all models that share the same guiding model
have aligned spike timings. We show the advantage of our method in various
scenarios including posterior fusion of CTC models and knowledge distillation
between CTC models with different architectures. With the 300-hour Switchboard
training data, the single word CTC model distilled from multiple models
improved the word error rates to 13.7%/23.1% from 14.9%/24.1% on the Hub5 2000
Switchboard/CallHome test sets without using any data augmentation, language
model, or complex decoder.Comment: Accepted to Interspeech 201
ASR is all you need: cross-modal distillation for lip reading
The goal of this work is to train strong models for visual speech recognition
without requiring human annotated ground truth data. We achieve this by
distilling from an Automatic Speech Recognition (ASR) model that has been
trained on a large-scale audio-only corpus. We use a cross-modal distillation
method that combines Connectionist Temporal Classification (CTC) with a
frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that
ground truth transcriptions are not necessary to train a lip reading system;
(ii) we show how arbitrary amounts of unlabelled video data can be leveraged to
improve performance; (iii) we demonstrate that distillation significantly
speeds up training; and, (iv) we obtain state-of-the-art results on the
challenging LRS2 and LRS3 datasets for training only on publicly available
data.Comment: ICASSP 202
Deep Lip Reading: a comparison of models and an online application
The goal of this paper is to develop state-of-the-art models for lip reading
-- visual speech recognition. We develop three architectures and compare their
accuracy and training times: (i) a recurrent model using LSTMs; (ii) a fully
convolutional model; and (iii) the recently proposed transformer model. The
recurrent and fully convolutional models are trained with a Connectionist
Temporal Classification loss and use an explicit language model for decoding,
the transformer is a sequence-to-sequence model. Our best performing model
improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip
Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent.
As a further contribution we investigate the fully convolutional model when
used for online (real time) lip reading of continuous speech, and show that it
achieves high performance with low latency.Comment: To appear in Interspeech 201
Analysis of Automatic Speech Recognition Methods
This paper outlines structures of different automatic speech recognition systems, hybrid and end-to-end, pros and cons for each of them, including the comparison of training data and computational resources requirements. Three main approaches to speech recognition are considered: hybrid Hidden Markov Model β Deep Neural Network, end-to-end Connectionist Temporal Classification and Sequence-to-Sequence. The Listen, Attend, and Spell approach is chosen as an example for the Sequence-to-Sequence model
VPN: Learning Video-Pose Embedding for Activities of Daily Living
In this paper, we focus on the spatio-temporal aspect of recognizing
Activities of Daily Living (ADL). ADL have two specific properties (i) subtle
spatio-temporal patterns and (ii) similar visual patterns varying with time.
Therefore, ADL may look very similar and often necessitate to look at their
fine-grained details to distinguish them. Because the recent spatio-temporal 3D
ConvNets are too rigid to capture the subtle visual patterns across an action,
we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN
are a spatial embedding and an attention network. The spatial embedding
projects the 3D poses and RGB cues in a common semantic space. This enables the
action recognition framework to learn better spatio-temporal features
exploiting both modalities. In order to discriminate similar actions, the
attention network provides two functionalities - (i) an end-to-end learnable
pose backbone exploiting the topology of human body, and (ii) a coupler to
provide joint spatio-temporal attention weights across a video. Experiments
show that VPN outperforms the state-of-the-art results for action
classification on a large scale human activity dataset: NTU-RGB+D 120, its
subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota
Smarthome and a small scale human-object interaction dataset Northwestern UCLA.Comment: Accepted in ECCV 202