4 research outputs found
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
We present a novel conversational-context aware end-to-end speech recognizer
based on a gated neural network that incorporates
conversational-context/word/speech embeddings. Unlike conventional speech
recognition models, our model learns longer conversational-context information
that spans across sentences and is consequently better at recognizing long
conversations. Specifically, we propose to use the text-based external word
and/or sentence embeddings (i.e., fastText, BERT) within an end-to-end
framework, yielding a significant improvement in word error rate with better
conversational-context representation. We evaluated the models on the
Switchboard conversational speech corpus and show that our model outperforms
standard end-to-end speech recognition models.Comment: ACL 201
Cross-Attention End-to-End ASR for Two-Party Conversations
We present an end-to-end speech recognition model that learns interaction
between two speakers based on the turn-changing information. Unlike
conventional speech recognition models, our model exploits two speakers'
history of conversational-context information that spans across multiple turns
within an end-to-end framework. Specifically, we propose a speaker-specific
cross-attention mechanism that can look at the output of the other speaker side
as well as the one of the current speaker for better at recognizing long
conversations. We evaluated the models on the Switchboard conversational speech
corpus and show that our model outperforms standard end-to-end speech
recognition models.Comment: Interspeech 201
Distilling the Knowledge of BERT for Sequence-to-Sequence ASR
Attention-based sequence-to-sequence (seq2seq) models have achieved promising
results in automatic speech recognition (ASR). However, as these models decode
in a left-to-right way, they do not have access to context on the right. We
leverage both left and right context by applying BERT as an external language
model to seq2seq ASR through knowledge distillation. In our proposed method,
BERT generates soft labels to guide the training of seq2seq ASR. Furthermore,
we leverage context beyond the current utterance as input to BERT. Experimental
evaluations show that our method significantly improves the ASR performance
from the seq2seq baseline on the Corpus of Spontaneous Japanese (CSJ).
Knowledge distillation from BERT outperforms that from a transformer LM that
only looks at left context. We also show the effectiveness of leveraging
context beyond the current utterance. Our method outperforms other LM
application approaches such as n-best rescoring and shallow fusion, while it
does not require extra inference cost.Comment: Accepted in INTERSPEECH202
Cross-Utterance Language Models with Acoustic Error Sampling
The effective exploitation of richer contextual information in language
models (LMs) is a long-standing research problem for automatic speech
recognition (ASR). A cross-utterance LM (CULM) is proposed in this paper, which
augments the input to a standard long short-term memory (LSTM) LM with a
context vector derived from past and future utterances using an extraction
network. The extraction network uses another LSTM to encode surrounding
utterances into vectors which are integrated into a context vector using either
a projection of LSTM final hidden states, or a multi-head self-attentive layer.
In addition, an acoustic error sampling technique is proposed to reduce the
mismatch between training and test-time. This is achieved by considering
possible ASR errors into the model training procedure, and can therefore
improve the word error rate (WER). Experiments performed on both AMI and
Switchboard datasets show that CULMs outperform the LSTM LM baseline WER. In
particular, the CULM with a self-attentive layer-based extraction network and
acoustic error sampling achieves 0.6% absolute WER reduction on AMI, 0.3% WER
reduction on the Switchboard part and 0.9% WER reduction on the Callhome part
of Eval2000 test set over the respective baselines.Comment: 5 page