Search CORE

54 research outputs found

Audio-attention discriminative language model for ASR rescoring

Author: Gandhe Ankur
Rastrow Ariya
Publication venue
Publication date: 18/02/2020
Field of study

End-to-end approaches for automatic speech recognition (ASR) benefit from directly modeling the probability of the word sequence given the input audio stream in a single neural network. However, compared to conventional ASR systems, these models typically require more data to achieve comparable results. Well-known model adaptation techniques, to account for domain and style adaptation, are not easily applicable to end-to-end systems. Conventional HMM-based systems, on the other hand, have been optimized for various production environments and use cases. In this work, we propose to combine the benefits of end-to-end approaches with a conventional system using an attention-based discriminative language model that learns to rescore the output of a first-pass ASR system. We show that learning to rescore a list of potential ASR outputs is much simpler than learning to generate the hypothesis. The proposed model results in 8% improvement in word error rate even when the amount of training data is a fraction of data used for training the first-pass system.Comment: 4 pages, 1 figure, Accepted at ICASSP 202

arXiv.org e-Print Archive

Crossref

Constrained Discriminative Training of N-gram Language Models

Author: Abhinav Sethy
Ariya Rastrow
Bhuvana Ramabhadran
Publication venue
Publication date: 01/01/2009
Field of study

Abstract—In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition system, however, discriminative training for N-gram language models (LMs) has not yielded the same impact. In this paper, we present three techniques to improve the discriminative training of LMs, namely updating the back-off probability of unseen events, normalization of the N-gram updates to ensure a probability distribution and a relative-entropy based global constraint on the N-gram probability updates. We also present a framework for discriminative adaptation of LMs to a new domain and compare it to existing linear interpolation methods. Results are reported on the Broadcast News and the MIT lecture corpora. A modest improvement of 0.2 % absolute (on Broadcast News) and 0.3% absolute (on MIT lectures) was observed with discriminatively trained LMs over state-of-the-art systems. I

CiteSeerX

Crossref

Streaming Speech-to-Confusion Network Speech Recognition

Author: Filimonov Denis
Gandhe Ankur
Pandey Prabhat
Rastrow Ariya
Stolcke Andreas
Publication venue
Publication date: 02/06/2023
Field of study

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

Author: Bui Bach
Dheram Pranav
Raju Anirudh
Rao Milind
Rastrow Ariya
Publication venue: 'International Speech Communication Association'
Publication date: 13/08/2020
Field of study

We consider the problem of spoken language understanding (SLU) of extracting natural language intents and associated slot arguments or named entities from speech that is primarily directed at voice assistants. Such a system subsumes both automatic speech recognition (ASR) as well as natural language understanding (NLU). An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios like devices enabling voice assistants to work offline, in a privacy preserving manner, whilst also reducing server costs. We first present models that extract utterance intent directly from speech without intermediate text output. We then present a compositional model, which generates the transcript using the Listen Attend Spell ASR system and then extracts interpretation using a neural NLU model. Finally, we contrast these methods to a jointly trained end-to-end joint SLU model, consisting of ASR and NLU subsystems which are connected by a neural network based interface instead of text, that produces transcripts as well as NLU interpretation. We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.Comment: Proceedings of INTERSPEEC

arXiv.org e-Print Archive

Crossref

Contextual Language Model Adaptation for Conversational Agents

Author: Gandhe Ankur
Hedayatnia Behnam
Khatri Chandra
Liu Linda
Metallinou Angeliki
Raju Anirudh
Rastrow Ariya
Venkatesh Anu
Publication venue: 'International Speech Communication Association'
Publication date: 31/07/2018
Field of study

Statistical language models (LM) play a key role in Automatic Speech Recognition (ASR) systems used by conversational agents. These ASR systems should provide a high accuracy under a variety of speaking styles, domains, vocabulary and argots. In this paper, we present a DNN-based method to adapt the LM to each user-agent interaction based on generalized contextual information, by predicting an optimal, context-dependent set of LM interpolation weights. We show that this framework for contextual adaptation provides accuracy improvements under different possible mixture LM partitions that are relevant for both (1) Goal-oriented conversational agents where it's natural to partition the data by the requested application and for (2) Non-goal oriented conversational agents where the data can be partitioned using topic labels that come from predictions of a topic classifier. We obtain a relative WER improvement of 3% with a 1-pass decoding strategy and 6% in a 2-pass decoding framework, over an unadapted model. We also show up to a 15% relative improvement in recognizing named entities which is of significant value for conversational ASR systems.Comment: Interspeech 2018 (accepted

arXiv.org e-Print Archive

Crossref