1 research outputs found
Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio
Bootstrapping speech recognition on limited data resources has been an area
of active research for long. The recent transition to all-neural models and
end-to-end (E2E) training brought along particular challenges as these models
are known to be data hungry, but also came with opportunities around
language-agnostic representations derived from multilingual data as well as
shared word-piece output representations across languages that share script and
roots. We investigate here the effectiveness of different strategies to
bootstrap an RNN-Transducer (RNN-T) based automatic speech recognition (ASR)
system in the low resource regime, while exploiting the abundant resources
available in other languages as well as the synthetic audio from a
text-to-speech (TTS) engine. Our experiments demonstrate that transfer learning
from a multilingual model, using a post-ASR text-to-text mapping and synthetic
audio deliver additive improvements, allowing us to bootstrap a model for a new
language with a fraction of the data that would otherwise be needed. The best
system achieved a 46% relative word error rate (WER) reduction compared to the
monolingual baseline, among which 25% relative WER improvement is attributed to
the post-ASR text-to-text mappings and the TTS synthetic data