4 research outputs found
Joint Contextual Modeling for ASR Correction and Language Understanding
The quality of automatic speech recognition (ASR) is critical to Dialogue
Systems as ASR errors propagate to and directly impact downstream tasks such as
language understanding (LU). In this paper, we propose multi-task neural
approaches to perform contextual language correction on ASR outputs jointly
with LU to improve the performance of both tasks simultaneously. To measure the
effectiveness of this approach we used a public benchmark, the 2nd Dialogue
State Tracking (DSTC2) corpus. As a baseline approach, we trained task-specific
Statistical Language Models (SLM) and fine-tuned state-of-the-art Generalized
Pre-training (GPT) Language Model to re-rank the n-best ASR hypotheses,
followed by a model to identify the dialog act and slots. i) We further trained
ranker models using GPT and Hierarchical CNN-RNN models with discriminatory
losses to detect the best output given n-best hypotheses. We extended these
ranker models to first select the best ASR output and then identify the
dialogue act and slots in an end to end fashion. ii) We also proposed a novel
joint ASR error correction and LU model, a word confusion pointer network
(WCN-Ptr) with multi-head self-attention on top, which consumes the word
confusions populated from the n-best. We show that the error rates of off the
shelf ASR and following LU systems can be reduced significantly by 14% relative
with joint models trained using small amounts of in-domain data.Comment: Accepted at IEEE ICASSP 202
Warped Language Models for Noise Robust Language Understanding
Masked Language Models (MLM) are self-supervised neural networks trained to
fill in the blanks in a given sentence with masked tokens. Despite the
tremendous success of MLMs for various text based tasks, they are not robust
for spoken language understanding, especially for spontaneous conversational
speech recognition noise. In this work we introduce Warped Language Models
(WLM) in which input sentences at training time go through the same
modifications as in MLM, plus two additional modifications, namely inserting
and dropping random tokens. These two modifications extend and contract the
sentence in addition to the modifications in MLMs, hence the word "warped" in
the name. The insertion and drop modification of the input text during training
of WLM resemble the types of noise due to Automatic Speech Recognition (ASR)
errors, and as a result WLMs are likely to be more robust to ASR noise. Through
computational results we show that natural language understanding systems built
on top of WLMs perform better compared to those built based on MLMs, especially
in the presence of ASR errors.Comment: To appear at IEEE SLT 202
Phoneme-BERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript
Recent years have witnessed significant improvement in ASR systems to
recognize spoken utterances. However, it is still a challenging task for noisy
and out-of-domain data, where substitution and deletion errors are prevalent in
the transcribed text. These errors significantly degrade the performance of
downstream tasks. In this work, we propose a BERT-style language model,
referred to as PhonemeBERT, that learns a joint language model with phoneme
sequence and ASR transcript to learn phonetic-aware representations that are
robust to ASR errors. We show that PhonemeBERT can be used on downstream tasks
using phoneme sequences as additional features, and also in low-resource setup
where we only have ASR-transcripts for the downstream tasks with no phoneme
information available. We evaluate our approach extensively by generating noisy
data for three benchmark datasets - Stanford Sentiment Treebank, TREC and ATIS
for sentiment, question and intent classification tasks respectively. The
results of the proposed approach beats the state-of-the-art baselines
comprehensively on each dataset.Comment: Accepted to Interspeech 2021 conferenc
An Approach to Improve Robustness of NLP Systems against ASR Errors
Speech-enabled systems typically first convert audio to text through an
automatic speech recognition (ASR) model and then feed the text to downstream
natural language processing (NLP) modules. The errors of the ASR system can
seriously downgrade the performance of the NLP modules. Therefore, it is
essential to make them robust to the ASR errors. Previous work has shown it is
effective to employ data augmentation methods to solve this problem by
injecting ASR noise during the training process. In this paper, we utilize the
prevalent pre-trained language model to generate training samples with
ASR-plausible noise. Compare to the previous methods, our approach generates
ASR noise that better fits the real-world error distribution. Experimental
results on spoken language translation(SLT) and spoken language understanding
(SLU) show that our approach effectively improves the system robustness against
the ASR errors and achieves state-of-the-art results on both tasks.Comment: 9 pages, 3 figure