15 research outputs found
Multimodal Grounding for Sequence-to-Sequence Speech Recognition
Humans are capable of processing speech by making use of multiple sensory
modalities. For example, the environment where a conversation takes place
generally provides semantic and/or acoustic context that helps us to resolve
ambiguities or to recall named entities. Motivated by this, there have been
many works studying the integration of visual information into the speech
recognition pipeline. Specifically, in our previous work, we propose a
multistep visual adaptive training approach which improves the accuracy of an
audio-based Automatic Speech Recognition (ASR) system. This approach, however,
is not end-to-end as it requires fine-tuning the whole model with an adaptation
layer. In this paper, we propose novel end-to-end multimodal ASR systems and
compare them to the adaptive approach by using a range of visual
representations obtained from state-of-the-art convolutional neural networks.
We show that adaptive training is effective for S2S models leading to an
absolute improvement of 1.4% in word error rate. As for the end-to-end systems,
although they perform better than baseline, the improvements are slightly less
than adaptive training, 0.8 absolute WER reduction in single-best models. Using
ensemble decoding, end-to-end models reach a WER of 15% which is the lowest
score among all systems.Comment: ICASSP 201
Visually grounded learning of keyword prediction from untranscribed speech
During language acquisition, infants have the benefit of visual cues to
ground spoken language. Robots similarly have access to audio and visual
sensors. Recent work has shown that images and spoken captions can be mapped
into a meaningful common space, allowing images to be retrieved using speech
and vice versa. In this setting of images paired with untranscribed spoken
captions, we consider whether computer vision systems can be used to obtain
textual labels for the speech. Concretely, we use an image-to-words multi-label
visual classifier to tag images with soft textual labels, and then train a
neural network to map from the speech to these soft targets. We show that the
resulting speech system is able to predict which words occur in an
utterance---acting as a spoken bag-of-words classifier---without seeing any
parallel speech and text. We find that the model often confuses semantically
related words, e.g. "man" and "person", making it even more effective as a
semantic keyword spotter.Comment: 5 pages, 3 figures, 5 tables; small updates, added link to code;
accepted to Interspeech 201
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video
Audio-visual automatic speech recognition (AV-ASR) extends speech recognition
by introducing the video modality as an additional source of information. In
this work, the information contained in the motion of the speaker's mouth is
used to augment the audio features. The video modality is traditionally
processed with a 3D convolutional neural network (e.g. 3D version of VGG).
Recently, image transformer networks arXiv:2010.11929 demonstrated the ability
to extract rich visual features for image classification tasks. Here, we
propose to replace the 3D convolution with a video transformer to extract
visual features. We train our baselines and the proposed model on a large scale
corpus of YouTube videos. The performance of our approach is evaluated on a
labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our
best video-only model obtains 31.4% WER on YTDEV18 and 17.0% on LRS3-TED, a 10%
and 15% relative improvements over our convolutional baseline. We achieve the
state of the art performance of the audio-visual recognition on the LRS3-TED
after fine-tuning our model (1.6% WER). In addition, in a series of experiments
on multi-person AV-ASR, we obtained an average relative reduction of 2% over
our convolutional video frontend.Comment: 5 pages, 3 figures, published at Interspeech 202
Multimodal Language Analysis with Recurrent Multistage Fusion
Computational modeling of human multimodal language is an emerging research
area in natural language processing spanning the language, visual and acoustic
modalities. Comprehending multimodal language requires modeling not only the
interactions within each modality (intra-modal interactions) but more
importantly the interactions between modalities (cross-modal interactions). In
this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which
decomposes the fusion problem into multiple stages, each of them focused on a
subset of multimodal signals for specialized, effective fusion. Cross-modal
interactions are modeled using this multistage fusion approach which builds
upon intermediate representations of previous stages. Temporal and intra-modal
interactions are modeled by integrating our proposed fusion approach with a
system of recurrent neural networks. The RMFN displays state-of-the-art
performance in modeling human multimodal language across three public datasets
relating to multimodal sentiment analysis, emotion recognition, and speaker
traits recognition. We provide visualizations to show that each stage of fusion
focuses on a different subset of multimodal signals, learning increasingly
discriminative multimodal representations.Comment: EMNLP 201