57 research outputs found
Audio-attention discriminative language model for ASR rescoring
End-to-end approaches for automatic speech recognition (ASR) benefit from
directly modeling the probability of the word sequence given the input audio
stream in a single neural network. However, compared to conventional ASR
systems, these models typically require more data to achieve comparable
results. Well-known model adaptation techniques, to account for domain and
style adaptation, are not easily applicable to end-to-end systems. Conventional
HMM-based systems, on the other hand, have been optimized for various
production environments and use cases. In this work, we propose to combine the
benefits of end-to-end approaches with a conventional system using an
attention-based discriminative language model that learns to rescore the output
of a first-pass ASR system. We show that learning to rescore a list of
potential ASR outputs is much simpler than learning to generate the hypothesis.
The proposed model results in 8% improvement in word error rate even when the
amount of training data is a fraction of data used for training the first-pass
system.Comment: 4 pages, 1 figure, Accepted at ICASSP 202
Constrained Discriminative Training of N-gram Language Models
Abstract—In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition system, however, discriminative training for N-gram language models (LMs) has not yielded the same impact. In this paper, we present three techniques to improve the discriminative training of LMs, namely updating the back-off probability of unseen events, normalization of the N-gram updates to ensure a probability distribution and a relative-entropy based global constraint on the N-gram probability updates. We also present a framework for discriminative adaptation of LMs to a new domain and compare it to existing linear interpolation methods. Results are reported on the Broadcast News and the MIT lecture corpora. A modest improvement of 0.2 % absolute (on Broadcast News) and 0.3% absolute (on MIT lectures) was observed with discriminatively trained LMs over state-of-the-art systems. I
Streaming Speech-to-Confusion Network Speech Recognition
In interactive automatic speech recognition (ASR) systems, low-latency
requirements limit the amount of search space that can be explored during
decoding, particularly in end-to-end neural ASR. In this paper, we present a
novel streaming ASR architecture that outputs a confusion network while
maintaining limited latency, as needed for interactive applications. We show
that 1-best results of our model are on par with a comparable RNN-T system,
while the richer hypothesis set allows second-pass rescoring to achieve 10-20\%
lower word error rate on the LibriSpeech task. We also show that our model
outperforms a strong RNN-T baseline on a far-field voice assistant task.Comment: Submitted to Interspeech 202
Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces
We consider the problem of spoken language understanding (SLU) of extracting
natural language intents and associated slot arguments or named entities from
speech that is primarily directed at voice assistants. Such a system subsumes
both automatic speech recognition (ASR) as well as natural language
understanding (NLU). An end-to-end joint SLU model can be built to a required
specification opening up the opportunity to deploy on hardware constrained
scenarios like devices enabling voice assistants to work offline, in a privacy
preserving manner, whilst also reducing server costs.
We first present models that extract utterance intent directly from speech
without intermediate text output. We then present a compositional model, which
generates the transcript using the Listen Attend Spell ASR system and then
extracts interpretation using a neural NLU model. Finally, we contrast these
methods to a jointly trained end-to-end joint SLU model, consisting of ASR and
NLU subsystems which are connected by a neural network based interface instead
of text, that produces transcripts as well as NLU interpretation. We show that
the jointly trained model shows improvements to ASR incorporating semantic
information from NLU and also improves NLU by exposing it to ASR confusion
encoded in the hidden layer.Comment: Proceedings of INTERSPEEC
Contextual Language Model Adaptation for Conversational Agents
Statistical language models (LM) play a key role in Automatic Speech
Recognition (ASR) systems used by conversational agents. These ASR systems
should provide a high accuracy under a variety of speaking styles, domains,
vocabulary and argots. In this paper, we present a DNN-based method to adapt
the LM to each user-agent interaction based on generalized contextual
information, by predicting an optimal, context-dependent set of LM
interpolation weights. We show that this framework for contextual adaptation
provides accuracy improvements under different possible mixture LM partitions
that are relevant for both (1) Goal-oriented conversational agents where it's
natural to partition the data by the requested application and for (2) Non-goal
oriented conversational agents where the data can be partitioned using topic
labels that come from predictions of a topic classifier. We obtain a relative
WER improvement of 3% with a 1-pass decoding strategy and 6% in a 2-pass
decoding framework, over an unadapted model. We also show up to a 15% relative
improvement in recognizing named entities which is of significant value for
conversational ASR systems.Comment: Interspeech 2018 (accepted
- …