2,647 research outputs found
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Leveraging native language information for improved accented speech recognition
Recognition of accented speech is a long-standing challenge for automatic
speech recognition (ASR) systems, given the increasing worldwide population of
bi-lingual speakers with English as their second language. If we consider
foreign-accented speech as an interpolation of the native language (L1) and
English (L2), using a model that can simultaneously address both languages
would perform better at the acoustic level for accented speech. In this study,
we explore how an end-to-end recurrent neural network (RNN) trained system with
English and native languages (Spanish and Indian languages) could leverage data
of native languages to improve performance for accented English speech. To this
end, we examine pre-training with native languages, as well as multi-task
learning (MTL) in which the main task is trained with native English and the
secondary task is trained with Spanish or Indian Languages. We show that the
proposed MTL model performs better than the pre-training approach and
outperforms a baseline model trained simply with English data. We suggest a new
setting for MTL in which the secondary task is trained with both English and
the native language, using the same output set. This proposed scenario yields
better performance with +11.95% and +17.55% character error rate gains over
baseline for Hispanic and Indian accents, respectively.Comment: Accepted at Interspeech 201
Multilingual Speech Recognition With A Single End-To-End Model
Training a conventional automatic speech recognition (ASR) system to support
multiple languages is challenging because the sub-word unit, lexicon and word
inventories are typically language specific. In contrast, sequence-to-sequence
models are well suited for multilingual ASR because they encapsulate an
acoustic, pronunciation and language model jointly in a single network. In this
work we present a single sequence-to-sequence ASR model trained on 9 different
Indian languages, which have very little overlap in their scripts.
Specifically, we take a union of language-specific grapheme sets and train a
grapheme-based sequence-to-sequence model jointly on data from all languages.
We find that this model, which is not explicitly given any information about
language identity, improves recognition performance by 21% relative compared to
analogous sequence-to-sequence models trained on each language individually. By
modifying the model to accept a language identifier as an additional input
feature, we further improve performance by an additional 7% relative and
eliminate confusion between different languages.Comment: Accepted in ICASSP 201
Symbolic inductive bias for visually grounded learning of spoken language
A widespread approach to processing spoken language is to first automatically
transcribe it into text. An alternative is to use an end-to-end approach:
recent works have proposed to learn semantic embeddings of spoken language from
images with spoken captions, without an intermediate transcription step. We
propose to use multitask learning to exploit existing transcribed speech within
the end-to-end setting. We describe a three-task architecture which combines
the objectives of matching spoken captions with corresponding images, speech
with text, and text with images. We show that the addition of the speech/text
task leads to substantial performance improvements on image retrieval when
compared to training the speech/image task in isolation. We conjecture that
this is due to a strong inductive bias transcribed speech provides to the
model, and offer supporting evidence for this.Comment: ACL 201
- …