2 research outputs found
Toward Cross-Domain Speech Recognition with End-to-End Models
In the area of multi-domain speech recognition, research in the past focused
on hybrid acoustic models to build cross-domain and domain-invariant speech
recognition systems. In this paper, we empirically examine the difference in
behavior between hybrid acoustic models and neural end-to-end systems when
mixing acoustic training data from several domains. For these experiments we
composed a multi-domain dataset from public sources, with the different domains
in the corpus covering a wide variety of topics and acoustic conditions such as
telephone conversations, lectures, read speech and broadcast news. We show that
for the hybrid models, supplying additional training data from other domains
with mismatched acoustic conditions does not increase the performance on
specific domains. However, our end-to-end models optimized with sequence-based
criterion generalize better than the hybrid models on diverse domains. In term
of word-error-rate performance, our experimental acoustic-to-word and
attention-based models trained on multi-domain dataset reach the performance of
domain-specific long short-term memory (LSTM) hybrid models, thus resulting in
multi-domain speech recognition systems that do not suffer in performance over
domain specific ones. Moreover, the use of neural end-to-end models eliminates
the need of domain-adapted language models during recognition, which is a great
advantage when the input domain is unknown.Comment: Presented in Life-Long Learning for Spoken Language Systems Workshop
- ASRU 201
Towards Lifelong Learning of End-to-end ASR
Automatic speech recognition (ASR) technologies today are primarily optimized
for given datasets; thus, any changes in the application environment (e.g.,
acoustic conditions or topic domains) may inevitably degrade the performance.
We can collect new data describing the new environment and fine-tune the
system, but this naturally leads to higher error rates for the earlier
datasets, referred to as catastrophic forgetting. The concept of lifelong
learning (LLL) aiming to enable a machine to sequentially learn new tasks from
new datasets describing the changing real world without forgetting the
previously learned knowledge is thus brought to attention. This paper reports,
to our knowledge, the first effort to extensively consider and analyze the use
of various approaches of LLL in end-to-end (E2E) ASR, including proposing novel
methods in saving data for past domains to mitigate the catastrophic forgetting
problem. An overall relative reduction of 28.7% in WER was achieved compared to
the fine-tuning baseline when sequentially learning on three very different
benchmark corpora. This can be the first step toward the highly desired ASR
technologies capable of synchronizing with the continuously changing real
world.Comment: Interspeech 2021. We acknowledge the support of Salesforce Research
Deep Learning Gran