2,084 research outputs found
Hierarchical Multi Task Learning With CTC
In Automatic Speech Recognition it is still challenging to learn useful
intermediate representations when using high-level (or abstract) target units
such as words. For that reason, character or phoneme based systems tend to
outperform word-based systems when just few hundreds of hours of training data
are being used. In this paper, we first show how hierarchical multi-task
training can encourage the formation of useful intermediate representations. We
achieve this by performing Connectionist Temporal Classification at different
levels of the network with targets of different granularity. Our model thus
performs predictions in multiple scales for the same input. On the standard
300h Switchboard training setup, our hierarchical multi-task architecture
exhibits improvements over single-task architectures with the same number of
parameters. Our model obtains 14.0% Word Error Rate on the Eval2000 Switchboard
subset without any decoder or language model, outperforming the current
state-of-the-art on acoustic-to-word models.Comment: In Proceedings at SLT 201
Hierarchical Multitask Learning for CTC-based Speech Recognition
Previous work has shown that neural encoder-decoder speech recognition can be
improved with hierarchical multitask learning, where auxiliary tasks are added
at intermediate layers of a deep encoder. We explore the effect of hierarchical
multitask learning in the context of connectionist temporal classification
(CTC)-based speech recognition, and investigate several aspects of this
approach. Consistent with previous work, we observe performance improvements on
telephone conversational speech recognition (specifically the Eval2000 test
sets) when training a subword-level CTC model with an auxiliary phone loss at
an intermediate layer. We analyze the effects of a number of experimental
variables (like interpolation constant and position of the auxiliary loss
function), performance in lower-resource settings, and the relationship between
pretraining and multitask learning. We observe that the hierarchical multitask
approach improves over standard multitask training in our higher-data
experiments, while in the low-resource settings standard multitask training
works well. The best results are obtained by combining hierarchical multitask
learning and pretraining, which improves word error rates by 3.4% absolute on
the Eval2000 test sets.Comment: Technical Repor
Multi-encoder multi-resolution framework for end-to-end speech recognition
Attention-based methods and Connectionist Temporal Classification (CTC)
network have been promising research directions for end-to-end Automatic Speech
Recognition (ASR). The joint CTC/Attention model has achieved great success by
utilizing both architectures during multi-task training and joint decoding. In
this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework
based on the joint CTC/Attention model. Two heterogeneous encoders with
different architectures, temporal resolutions and separate CTC networks work in
parallel to extract complimentary acoustic information. A hierarchical
attention mechanism is then used to combine the encoder-level information. To
demonstrate the effectiveness of the proposed model, experiments are conducted
on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate
(WER) reduction of 18.0-32.1%. Moreover, the proposed MEMR model achieves 3.6%
WER in the WSJ eval92 test set, which is the best WER reported for an
end-to-end system on this benchmark
On the Inductive Bias of Word-Character-Level Multi-Task Learning for Speech Recognition
End-to-end automatic speech recognition (ASR) commonly transcribes audio
signals into sequences of characters while its performance is evaluated by
measuring the word-error rate (WER). This suggests that predicting sequences of
words directly may be helpful instead. However, training with word-level
supervision can be more difficult due to the sparsity of examples per label
class. In this paper we analyze an end-to-end ASR model that combines a
word-and-character representation in a multi-task learning (MTL) framework. We
show that it improves on the WER and study how the word-level model can benefit
from character-level supervision by analyzing the learned inductive preference
bias of each model component empirically. We find that by adding
character-level supervision, the MTL model interpolates between recognizing
more frequent words (preferred by the word-level model) and shorter words
(preferred by the character-level model).Comment: Accepted at the IRASL workshop at NeurIPS 201
Enhancing Handwritten Text Recognition with N-gram sequence decomposition and Multitask Learning
Current state-of-the-art approaches in the field of Handwritten Text
Recognition are predominately single task with unigram, character level target
units. In our work, we utilize a Multi-task Learning scheme, training the model
to perform decompositions of the target sequence with target units of different
granularity, from fine to coarse. We consider this method as a way to utilize
n-gram information, implicitly, in the training process, while the final
recognition is performed using only the unigram output. % in order to highlight
the difference of the internal Unigram decoding of such a multi-task approach
highlights the capability of the learned internal representations, imposed by
the different n-grams at the training step. We select n-grams as our target
units and we experiment from unigrams to fourgrams, namely subword level
granularities. These multiple decompositions are learned from the network with
task-specific CTC losses. Concerning network architectures, we propose two
alternatives, namely the Hierarchical and the Block Multi-task. Overall, our
proposed model, even though evaluated only on the unigram task, outperforms its
counterpart single-task by absolute 2.52\% WER and 1.02\% CER, in the greedy
decoding, without any computational overhead during inference, hinting towards
successfully imposing an implicit language model.Comment: ICPR 202
Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition
The performance of automatic speech recognition systems degrades with
increasing mismatch between the training and testing scenarios. Differences in
speaker accents are a significant source of such mismatch. The traditional
approach to deal with multiple accents involves pooling data from several
accents during training and building a single model in multi-task fashion,
where tasks correspond to individual accents. In this paper, we explore an
alternate model where we jointly learn an accent classifier and a multi-task
acoustic model. Experiments on the American English Wall Street Journal and
British English Cambridge corpora demonstrate that our joint model outperforms
the strong multi-task acoustic model baseline. We obtain a 5.94% relative
improvement in word error rate on British English, and 9.47% relative
improvement on American English. This illustrates that jointly modeling with
accent information improves acoustic model performance.Comment: Accepted in The 43rd IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP2018
Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer
We investigate training end-to-end speech recognition models with the
recurrent neural network transducer (RNN-T): a streaming, all-neural,
sequence-to-sequence architecture which jointly learns acoustic and language
model components from transcribed acoustic data. We explore various model
architectures and demonstrate how the model can be improved further if
additional text or pronunciation data are available. The model consists of an
`encoder', which is initialized from a connectionist temporal
classification-based (CTC) acoustic model, and a `decoder' which is partially
initialized from a recurrent neural network language model trained on text data
alone. The entire neural network is trained with the RNN-T loss and directly
outputs the recognized transcript as a sequence of graphemes, thus performing
end-to-end speech recognition. We find that performance can be improved further
through the use of sub-word units (`wordpieces') which capture longer context
and significantly reduce substitution errors. The best RNN-T system, a
twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000
wordpieces as output targets achieves a word error rate of 8.5\% on
voice-search and 5.2\% on voice-dictation tasks and is comparable to a
state-of-the-art baseline at 8.3\% on voice-search and 5.4\% voice-dictation.Comment: In Proceedings of IEEE ASRU 201
Massively Multilingual Adversarial Speech Recognition
We report on adaptation of multilingual end-to-end speech recognition models
trained on as many as 100 languages. Our findings shed light on the relative
importance of similarity between the target and pretraining languages along the
dimensions of phonetics, phonology, language family, geographical location, and
orthography. In this context, experiments demonstrate the effectiveness of two
additional pretraining objectives in encouraging language-independent encoder
representations: a context-independent phoneme objective paired with a
language-adversarial classification objective.Comment: Accepted at NAACL-HLT 201
Advancing Multi-Accented LSTM-CTC Speech Recognition using a Domain Specific Student-Teacher Learning Paradigm
Non-native speech causes automatic speech recognition systems to degrade in
performance. Past strategies to address this challenge have considered model
adaptation, accent classification with a model selection, alternate
pronunciation lexicon, etc. In this study, we consider a recurrent neural
network (RNN) with connectionist temporal classification (CTC) cost function
trained on multi-accent English data including US (Native), Indian and Hispanic
accents. We exploit dark knowledge from a model trained with the multi-accent
data to train student models under the guidance of both a teacher model and CTC
cost of target transcription. We show that transferring knowledge from a single
RNN-CTC trained model toward a student model, yields better performance than
the stand-alone teacher model. Since the outputs of different trained CTC
models are not necessarily aligned, it is not possible to simply use an
ensemble of CTC teacher models. To address this problem, we train accent
specific models under the guidance of a single multi-accent teacher, which
results in having multiple aligned and trained CTC models. Furthermore, we
train a student model under the supervision of the accent-specific teachers,
resulting in an even further complementary model, which achieves +20.1%
relative Character Error Rate (CER) reduction compared to the baseline trained
without any teacher. Having this effective multi-accent model, we can achieve
further improvement for each accent by adapting the model to each accent. Using
the accent specific model's outputs to regularize the adapting process (i.e., a
knowledge distillation version of Kullback-Leibler (KL) divergence) results in
even superior performance compared to the conventional approach using general
teacher models.Comment: Accepted at SLT 201
Acoustically Grounded Word Embeddings for Improved Acoustics-to-Word Speech Recognition
Direct acoustics-to-word (A2W) systems for end-to-end automatic speech
recognition are simpler to train, and more efficient to decode with, than
sub-word systems. However, A2W systems can have difficulties at training time
when data is limited, and at decoding time when recognizing words outside the
training vocabulary. To address these shortcomings, we investigate the use of
recently proposed acoustic and acoustically grounded word embedding techniques
in A2W systems. The idea is based on treating the final pre-softmax weight
matrix of an AWE recognizer as a matrix of word embedding vectors, and using an
externally trained set of word embeddings to improve the quality of this
matrix. In particular we introduce two ideas: (1) Enforcing similarity at
training time between the external embeddings and the recognizer weights, and
(2) using the word embeddings at test time for predicting out-of-vocabulary
words. Our word embedding model is acoustically grounded, that is it is learned
jointly with acoustic embeddings so as to encode the words' acoustic-phonetic
content; and it is parametric, so that it can embed any arbitrary (potentially
out-of-vocabulary) sequence of characters. We find that both techniques improve
the performance of an A2W recognizer on conversational telephone speech.Comment: To appear at ICASSP 201
- …