Traditionally, research in automated speech recognition has focused on
local-first encoding of audio representations to predict the spoken phonemes in
an utterance. Unfortunately, approaches relying on such hyper-local information
tend to be vulnerable to both local-level corruption (such as audio-frame
drops, or loud noises) and global-level noise (such as environmental noise, or
background noise) that has not been seen during training. In this work, we
introduce a novel approach which leverages a self-supervised learning technique
based on masked language modeling to compute a global, multi-modal encoding of
the environment in which the utterance occurs. We then use a new deep-fusion
framework to integrate this global context into a traditional ASR method, and
demonstrate that the resulting method can outperform baseline methods by up to
7% on Librispeech; gains on internal datasets range from 6% (on larger models)
to 45% (on smaller models).Comment: Presented at ICASSP 202