1,038 research outputs found
Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
Automatic speech recognition can potentially benefit from the lip motion
patterns, complementing acoustic speech to improve the overall recognition
performance, particularly in noise. In this paper we propose an audio-visual
fusion strategy that goes beyond simple feature concatenation and learns to
automatically align the two modalities, leading to enhanced representations
which increase the recognition accuracy in both clean and noisy conditions. We
test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large
vocabulary continuous speech recognition, applying three types of noise at
different power ratios. We also exploit state of the art Sequence-to-Sequence
architectures, showing that our method can be easily integrated. Results show
relative improvements from 7% up to 30% on TCD-TIMIT over the acoustic modality
alone, depending on the acoustic noise level. We anticipate that the fusion
strategy can easily generalise to many other multimodal tasks which involve
correlated modalities. Code available online on GitHub:
https://github.com/georgesterpu/Sigmedia-AVSRComment: In ICMI'18, October 16-20, 2018, Boulder, CO, USA. Equation (2)
corrected on this versio
- …