20 research outputs found
Dialog act guided contextual adapter for personalized speech recognition
Personalization in multi-turn dialogs has been a long standing challenge for
end-to-end automatic speech recognition (E2E ASR) models. Recent work on
contextual adapters has tackled rare word recognition using user catalogs. This
adaptation, however, does not incorporate an important cue, the dialog act,
which is available in a multi-turn dialog scenario. In this work, we propose a
dialog act guided contextual adapter network. Specifically, it leverages dialog
acts to select the most relevant user catalogs and creates queries based on
both -- the audio as well as the semantic relationship between the carrier
phrase and user catalogs to better guide the contextual biasing. On industrial
voice assistant datasets, our model outperforms both the baselines - dialog act
encoder-only model, and the contextual adaptation, leading to the most
improvement over the no-context model: 58% average relative word error rate
reduction (WERR) in the multi-turn dialog scenario, in comparison to the
prior-art contextual adapter, which has achieved 39% WERR over the no-context
model.Comment: Accepted at ICASSP 202
Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit
remarkable performance. However, since the performance of such methods is
intrinsically linked to the context present in the training data, E2E-ASR
methods do not perform as desired for unseen user contexts (e.g., technical
terms, personal names, and playlists). Thus, E2E-ASR methods must be easily
contextualized by the user or developer. This paper proposes an attention-based
contextual biasing method that can be customized using an editable phrase list
(referred to as a bias list). The proposed method can be trained effectively by
combining a bias phrase index loss and special tokens to detect the bias
phrases in the input speech data. In addition, to improve the contextualization
performance during inference further, we propose a bias phrase boosted (BPB)
beam search algorithm based on the bias phrase index probability. Experimental
results demonstrate that the proposed method consistently improves the word
error rate and the character error rate of the target phrases in the bias list
on both the Librispeech-960 (English) and our in-house (Japanese) dataset,
respectively.Comment: accepted by ICASSP2022
SpellMapper: A non-autoregressive neural spellchecker for ASR customization with candidate retrieval based on n-gram mappings
Contextual spelling correction models are an alternative to shallow fusion to
improve automatic speech recognition (ASR) quality given user vocabulary. To
deal with large user vocabularies, most of these models include candidate
retrieval mechanisms, usually based on minimum edit distance between fragments
of ASR hypothesis and user phrases. However, the edit-distance approach is
slow, non-trainable, and may have low recall as it relies only on common
letters. We propose: 1) a novel algorithm for candidate retrieval, based on
misspelled n-gram mappings, which gives up to 90% recall with just the top 10
candidates on Spoken Wikipedia; 2) a non-autoregressive neural model based on
BERT architecture, where the initial transcript and ten candidates are combined
into one input. The experiments on Spoken Wikipedia show 21.4% word error rate
improvement compared to a baseline ASR system.Comment: Accepted by INTERSPEECH 202
PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers
End-to-End (E2E) automatic speech recognition (ASR) systems used in voice
assistants often have difficulties recognizing infrequent words personalized to
the user, such as names and places. Rare words often have non-trivial
pronunciations, and in such cases, human knowledge in the form of a
pronunciation lexicon can be useful. We propose a PROnunCiation-aware
conTextual adaptER (PROCTER) that dynamically injects lexicon knowledge into an
RNN-T model by adding a phonemic embedding along with a textual embedding. The
experimental results show that the proposed PROCTER architecture outperforms
the baseline RNN-T model by improving the word error rate (WER) by 44% and 57%
when measured on personalized entities and personalized rare entities,
respectively, while increasing the model size (number of trainable parameters)
by only 1%. Furthermore, when evaluated in a zero-shot setting to recognize
personalized device names, we observe 7% WER improvement with PROCTER, as
compared to only 1% WER improvement with text-only contextual attentionComment: To appear in Proc. IEEE ICASS
Can Contextual Biasing Remain Effective with Whisper and GPT-2?
End-to-end automatic speech recognition (ASR) and large language models, such
as Whisper and GPT-2, have recently been scaled to use vast amounts of training
data. Despite the large amount of training data, infrequent content words that
occur in a particular task may still exhibit poor ASR performance, with
contextual biasing a possible remedy. This paper investigates the effectiveness
of neural contextual biasing for Whisper combined with GPT-2. Specifically,
this paper proposes integrating an adapted tree-constrained pointer generator
(TCPGen) component for Whisper and a dedicated training scheme to dynamically
adjust the final output without modifying any Whisper model parameters.
Experiments across three datasets show a considerable reduction in errors on
biasing words with a biasing list of 1000 words. Contextual biasing was more
effective when applied to domain-specific data and can boost the performance of
Whisper and GPT-2 without losing their generality.Comment: To appear in Interspeech 202
Contextual Biasing of Named-Entities with Large Language Models
This paper studies contextual biasing with Large Language Models (LLMs),
where during second-pass rescoring additional contextual information is
provided to a LLM to boost Automatic Speech Recognition (ASR) performance. We
propose to leverage prompts for a LLM without fine tuning during rescoring
which incorporate a biasing list and few-shot examples to serve as additional
information when calculating the score for the hypothesis. In addition to
few-shot prompt learning, we propose multi-task training of the LLM to predict
both the entity class and the next token. To improve the efficiency for
contextual biasing and to avoid exceeding LLMs' maximum sequence lengths, we
propose dynamic prompting, where we select the most likely class using the
class tag prediction, and only use entities in this class as contexts for next
token prediction. Word Error Rate (WER) evaluation is performed on i) an
internal calling, messaging, and dictation dataset, and ii) the SLUE-Voxpopuli
dataset. Results indicate that biasing lists and few-shot examples can achieve
17.8% and 9.6% relative improvement compared to first pass ASR, and that
multi-task training and dynamic prompting can achieve 20.0% and 11.3% relative
WER improvement, respectively.Comment: 5 pages, 4 figures. Conference: ICASSP 202
Streaming Speech-to-Confusion Network Speech Recognition
In interactive automatic speech recognition (ASR) systems, low-latency
requirements limit the amount of search space that can be explored during
decoding, particularly in end-to-end neural ASR. In this paper, we present a
novel streaming ASR architecture that outputs a confusion network while
maintaining limited latency, as needed for interactive applications. We show
that 1-best results of our model are on par with a comparable RNN-T system,
while the richer hypothesis set allows second-pass rescoring to achieve 10-20\%
lower word error rate on the LibriSpeech task. We also show that our model
outperforms a strong RNN-T baseline on a far-field voice assistant task.Comment: Submitted to Interspeech 202
Phoneme-aware Encoding for Prefix-tree-based Contextual ASR
In speech recognition applications, it is important to recognize
context-specific rare words, such as proper nouns. Tree-constrained Pointer
Generator (TCPGen) has shown promise for this purpose, which efficiently biases
such words with a prefix tree. While the original TCPGen relies on
grapheme-based encoding, we propose extending it with phoneme-aware encoding to
better recognize words of unusual pronunciations. As TCPGen handles biasing
words as subword units, we propose obtaining subword-level phoneme-aware
encoding by using alignment between phonemes and subwords. Furthermore, we
propose injecting phoneme-level predictions from CTC into queries of TCPGen so
that the model better interprets the phoneme-aware encodings. We conducted ASR
experiments with TCPGen for RNN transducer. We observed that proposed
phoneme-aware encoding outperformed ordinary grapheme-based encoding on both
the English LibriSpeech and Japanese CSJ datasets, demonstrating the robustness
of our approach across linguistically diverse languages.Comment: Accepted to ICASSP202