Search CORE

1,351 research outputs found

Streaming End-to-end Speech Recognition For Mobile Devices

Author: Alvarez Raziel
Bagby Tom
Bhatia Deepti
Chang Shuo-yiin
Gruenstein Alexander
He Yanzhang
Kannan Anjuli
Li Bo
Liang Qiao
McGraw Ian
Pang Ruoming
Prabhavalkar Rohit
Pundak Golan
Rao Kanishka
Rybach David
Sainath Tara N.
Shangguan Yuan
Sim Khe Chai
Wu Yonghui
Zhao Ding
Publication venue
Publication date: 15/11/2018
Field of study

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories

arXiv.org e-Print Archive

Crossref

CB-Conformer: Contextual biasing Conformer for biased word recognition

Author: and Qiaochu Huang
Kang Shiyin
Liu Baiji
Meng Helen
Song Xingchen
Wu Zhiyong
Xu Yaoxun
Publication venue
Publication date: 19/04/2023
Field of study

Due to the mismatch between the source and target domains, how to better utilize the biased word information to improve the performance of the automatic speech recognition model in the target domain becomes a hot research topic. Previous approaches either decode with a fixed external language model or introduce a sizeable biasing module, which leads to poor adaptability and slow inference. In this work, we propose CB-Conformer to improve biased word recognition by introducing the Contextual Biasing Module and the Self-Adaptive Language Model to vanilla Conformer. The Contextual Biasing Module combines audio fragments and contextual information, with only 0.2% model parameters of the original Conformer. The Self-Adaptive Language Model modifies the internal weights of biased words based on their recall and precision, resulting in a greater focus on biased words and more successful integration with the automatic speech recognition model than the standard fixed language model. In addition, we construct and release an open-source Mandarin biased-word dataset based on WenetSpeech. Experiments indicate that our proposed method brings a 15.34% character error rate reduction, a 14.13% biased word recall increase, and a 6.80% biased word F1-score increase compared with the base Conformer

arXiv.org e-Print Archive

Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring

Author: Amara Ibtihel
Chadha Aman
Patwa Parth
Samuel Vinay
Sudarshan Ankitha
Publication venue
Publication date: 27/10/2023
Field of study

Automatic Speech Recognition (ASR) has witnessed a profound research interest. Recent breakthroughs have given ASR systems different prospects such as faithfully transcribing spoken language, which is a pivotal advancement in building conversational agents. However, there is still an imminent challenge of accurately discerning context-dependent words and phrases. In this work, we propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing leveraging the power of deep learning models in accurately delivering spot-on transcriptions across a wide variety of vocabularies and speaking styles. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models integrating both language and acoustic modeling for better accuracy. We infused our network with the use of a transformer-based model to properly rescore the word lattice achieving remarkable capabilities with a palpable reduction in Word Error Rate (WER). We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses

arXiv.org e-Print Archive

On Biasing Transformer Attention Towards Monotonicity

Author: Aepli Noëmi
Amrhein Chantal
Rios Annette
Sennrich Rico
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 08/04/2021
Field of study

Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining. In this work, we introduce a monotonicity loss function that is compatible with standard attention mechanisms and test it on several sequence-to-sequence tasks: grapheme-to-phoneme conversion, morphological inflection, transliteration, and dialect normalization. Experiments show that we can achieve largely monotonic behavior. Performance is mixed, with larger gains on top of RNN baselines. General monotonicity does not benefit transformer multihead attention, however, we see isolated improvements when only a subset of heads is biased towards monotonic behavior.Comment: To be published in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021

arXiv.org e-Print Archive

Edinburgh Research Explorer

Streaming Speech-to-Confusion Network Speech Recognition

Author: Filimonov Denis
Gandhe Ankur
Pandey Prabhat
Rastrow Ariya
Stolcke Andreas
Publication venue
Publication date: 02/06/2023
Field of study

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive