180 research outputs found
Multitask Learning with CTC and Segmental CRF for Speech Recognition
Segmental conditional random fields (SCRFs) and connectionist temporal
classification (CTC) are two sequence labeling methods used for end-to-end
training of speech recognition models. Both models define a transcription
probability by marginalizing decisions about latent segmentation alternatives
to derive a sequence probability: the former uses a globally normalized joint
model of segment labels and durations, and the latter classifies each frame as
either an output symbol or a "continuation" of the previous label. In this
paper, we train a recognition model by optimizing an interpolation between the
SCRF and CTC losses, where the same recurrent neural network (RNN) encoder is
used for feature extraction for both outputs. We find that this multitask
objective improves recognition accuracy when decoding with either the SCRF or
CTC models. Additionally, we show that CTC can also be used to pretrain the RNN
encoder, which improves the convergence rate when learning the joint model.Comment: 5 pages, 2 figures, camera ready version at Interspeech 201
Leveraging native language information for improved accented speech recognition
Recognition of accented speech is a long-standing challenge for automatic
speech recognition (ASR) systems, given the increasing worldwide population of
bi-lingual speakers with English as their second language. If we consider
foreign-accented speech as an interpolation of the native language (L1) and
English (L2), using a model that can simultaneously address both languages
would perform better at the acoustic level for accented speech. In this study,
we explore how an end-to-end recurrent neural network (RNN) trained system with
English and native languages (Spanish and Indian languages) could leverage data
of native languages to improve performance for accented English speech. To this
end, we examine pre-training with native languages, as well as multi-task
learning (MTL) in which the main task is trained with native English and the
secondary task is trained with Spanish or Indian Languages. We show that the
proposed MTL model performs better than the pre-training approach and
outperforms a baseline model trained simply with English data. We suggest a new
setting for MTL in which the secondary task is trained with both English and
the native language, using the same output set. This proposed scenario yields
better performance with +11.95% and +17.55% character error rate gains over
baseline for Hispanic and Indian accents, respectively.Comment: Accepted at Interspeech 201
Hierarchical Multi Task Learning With CTC
In Automatic Speech Recognition it is still challenging to learn useful
intermediate representations when using high-level (or abstract) target units
such as words. For that reason, character or phoneme based systems tend to
outperform word-based systems when just few hundreds of hours of training data
are being used. In this paper, we first show how hierarchical multi-task
training can encourage the formation of useful intermediate representations. We
achieve this by performing Connectionist Temporal Classification at different
levels of the network with targets of different granularity. Our model thus
performs predictions in multiple scales for the same input. On the standard
300h Switchboard training setup, our hierarchical multi-task architecture
exhibits improvements over single-task architectures with the same number of
parameters. Our model obtains 14.0% Word Error Rate on the Eval2000 Switchboard
subset without any decoder or language model, outperforming the current
state-of-the-art on acoustic-to-word models.Comment: In Proceedings at SLT 201
- …