37 research outputs found
Contextual Language Model Adaptation for Conversational Agents
Statistical language models (LM) play a key role in Automatic Speech
Recognition (ASR) systems used by conversational agents. These ASR systems
should provide a high accuracy under a variety of speaking styles, domains,
vocabulary and argots. In this paper, we present a DNN-based method to adapt
the LM to each user-agent interaction based on generalized contextual
information, by predicting an optimal, context-dependent set of LM
interpolation weights. We show that this framework for contextual adaptation
provides accuracy improvements under different possible mixture LM partitions
that are relevant for both (1) Goal-oriented conversational agents where it's
natural to partition the data by the requested application and for (2) Non-goal
oriented conversational agents where the data can be partitioned using topic
labels that come from predictions of a topic classifier. We obtain a relative
WER improvement of 3% with a 1-pass decoding strategy and 6% in a 2-pass
decoding framework, over an unadapted model. We also show up to a 15% relative
improvement in recognizing named entities which is of significant value for
conversational ASR systems.Comment: Interspeech 2018 (accepted
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition
With the rapid development of speech assistants, adapting server-intended
automatic speech recognition (ASR) solutions to a direct device has become
crucial. Researchers and industry prefer to use end-to-end ASR systems for
on-device speech recognition tasks. This is because end-to-end systems can be
made resource-efficient while maintaining a higher quality compared to hybrid
systems. However, building end-to-end models requires a significant amount of
speech data. Another challenging task associated with speech assistants is
personalization, which mainly lies in handling out-of-vocabulary (OOV) words.
In this work, we consider building an effective end-to-end ASR system in
low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel
Georgian tasks. To address the aforementioned problems, we propose a method of
dynamic acoustic unit augmentation based on the BPE-dropout technique. It
non-deterministically tokenizes utterances to extend the token's contexts and
to regularize their distribution for the model's recognition of unseen words.
It also reduces the need for optimal subword vocabulary size search. The
technique provides a steady improvement in regular and personalized
(OOV-oriented) speech recognition tasks (at least 6% relative WER and 25%
relative F-score) at no additional computational cost. Owing to the use of
BPE-dropout, our monolingual Turkish Conformer established a competitive result
with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is
close to the best published multilingual system.Comment: 16 pages, 7 figure