1,351 research outputs found
Streaming End-to-end Speech Recognition For Mobile Devices
End-to-end (E2E) models, which directly predict output character sequences
given input speech, are good candidates for on-device speech recognition. E2E
models, however, present numerous challenges: In order to be truly useful, such
models must decode speech utterances in a streaming fashion, in real time; they
must be robust to the long tail of use cases; they must be able to leverage
user-specific context (e.g., contact lists); and above all, they must be
extremely accurate. In this work, we describe our efforts at building an E2E
speech recognizer using a recurrent neural network transducer. In experimental
evaluations, we find that the proposed approach can outperform a conventional
CTC-based model in terms of both latency and accuracy in a number of evaluation
categories
CB-Conformer: Contextual biasing Conformer for biased word recognition
Due to the mismatch between the source and target domains, how to better
utilize the biased word information to improve the performance of the automatic
speech recognition model in the target domain becomes a hot research topic.
Previous approaches either decode with a fixed external language model or
introduce a sizeable biasing module, which leads to poor adaptability and slow
inference. In this work, we propose CB-Conformer to improve biased word
recognition by introducing the Contextual Biasing Module and the Self-Adaptive
Language Model to vanilla Conformer. The Contextual Biasing Module combines
audio fragments and contextual information, with only 0.2% model parameters of
the original Conformer. The Self-Adaptive Language Model modifies the internal
weights of biased words based on their recall and precision, resulting in a
greater focus on biased words and more successful integration with the
automatic speech recognition model than the standard fixed language model. In
addition, we construct and release an open-source Mandarin biased-word dataset
based on WenetSpeech. Experiments indicate that our proposed method brings a
15.34% character error rate reduction, a 14.13% biased word recall increase,
and a 6.80% biased word F1-score increase compared with the base Conformer
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring
Automatic Speech Recognition (ASR) has witnessed a profound research
interest. Recent breakthroughs have given ASR systems different prospects such
as faithfully transcribing spoken language, which is a pivotal advancement in
building conversational agents. However, there is still an imminent challenge
of accurately discerning context-dependent words and phrases. In this work, we
propose a novel approach for enhancing contextual recognition within ASR
systems via semantic lattice processing leveraging the power of deep learning
models in accurately delivering spot-on transcriptions across a wide variety of
vocabularies and speaking styles. Our solution consists of using Hidden Markov
Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks
(DNN) models integrating both language and acoustic modeling for better
accuracy. We infused our network with the use of a transformer-based model to
properly rescore the word lattice achieving remarkable capabilities with a
palpable reduction in Word Error Rate (WER). We demonstrate the effectiveness
of our proposed framework on the LibriSpeech dataset with empirical analyses
On Biasing Transformer Attention Towards Monotonicity
Many sequence-to-sequence tasks in natural language processing are roughly
monotonic in the alignment between source and target sequence, and previous
work has facilitated or enforced learning of monotonic attention behavior via
specialized attention functions or pretraining. In this work, we introduce a
monotonicity loss function that is compatible with standard attention
mechanisms and test it on several sequence-to-sequence tasks:
grapheme-to-phoneme conversion, morphological inflection, transliteration, and
dialect normalization. Experiments show that we can achieve largely monotonic
behavior. Performance is mixed, with larger gains on top of RNN baselines.
General monotonicity does not benefit transformer multihead attention, however,
we see isolated improvements when only a subset of heads is biased towards
monotonic behavior.Comment: To be published in: Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies (NAACL-HLT 2021
Streaming Speech-to-Confusion Network Speech Recognition
In interactive automatic speech recognition (ASR) systems, low-latency
requirements limit the amount of search space that can be explored during
decoding, particularly in end-to-end neural ASR. In this paper, we present a
novel streaming ASR architecture that outputs a confusion network while
maintaining limited latency, as needed for interactive applications. We show
that 1-best results of our model are on par with a comparable RNN-T system,
while the richer hypothesis set allows second-pass rescoring to achieve 10-20\%
lower word error rate on the LibriSpeech task. We also show that our model
outperforms a strong RNN-T baseline on a far-field voice assistant task.Comment: Submitted to Interspeech 202
- …