11,782 research outputs found
Modality Dropout for Improved Performance-driven Talking Faces
We describe our novel deep learning approach for driving animated faces using
both acoustic and visual information. In particular, speech-related facial
movements are generated using audiovisual information, and non-speech facial
movements are generated using only visual information. To ensure that our model
exploits both modalities during training, batches are generated that contain
audio-only, video-only, and audiovisual input features. The probability of
dropping a modality allows control over the degree to which the model exploits
audio and visual information during training. Our trained model runs in
real-time on resource limited hardware (e.g.\ a smart phone), it is user
agnostic, and it is not dependent on a potentially error-prone transcription of
the speech. We use subjective testing to demonstrate: 1) the improvement of
audiovisual-driven animation over the equivalent video-only approach, and 2)
the improvement in the animation of speech-related facial movements after
introducing modality dropout. Before introducing dropout, viewers prefer
audiovisual-driven animation in 51% of the test sequences compared with only
18% for video-driven. After introducing dropout viewer preference for
audiovisual-driven animation increases to 74%, but decreases to 8% for
video-only.Comment: Pre-prin
3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition
Audio-visual recognition (AVR) has been considered as a solution for speech
recognition tasks when the audio is corrupted, as well as a visual recognition
method used for speaker verification in multi-speaker scenarios. The approach
of AVR systems is to leverage the extracted information from one modality to
improve the recognition ability of the other modality by complementing the
missing information. The essential problem is to find the correspondence
between the audio and visual streams, which is the goal of this work. We
propose the use of a coupled 3D Convolutional Neural Network (3D-CNN)
architecture that can map both modalities into a representation space to
evaluate the correspondence of audio-visual streams using the learned
multimodal features. The proposed architecture will incorporate both spatial
and temporal information jointly to effectively find the correlation between
temporal information for different modalities. By using a relatively small
network architecture and much smaller dataset for training, our proposed method
surpasses the performance of the existing similar methods for audio-visual
matching which use 3D CNNs for feature representation. We also demonstrate that
an effective pair selection method can significantly increase the performance.
The proposed method achieves relative improvements over 20% on the Equal Error
Rate (EER) and over 7% on the Average Precision (AP) in comparison to the
state-of-the-art method
Multiresolution and Multimodal Speech Recognition with Transformers
This paper presents an audio visual automatic speech recognition (AV-ASR)
system using a Transformer-based architecture. We particularly focus on the
scene context provided by the visual information, to ground the ASR. We extract
representations for audio features in the encoder layers of the transformer and
fuse video features using an additional crossmodal multihead attention layer.
Additionally, we incorporate a multitask training criterion for multiresolution
ASR, where we train the model to generate both character and subword level
transcriptions.
Experimental results on the How2 dataset, indicate that multiresolution
training can speed up convergence by around 50% and relatively improves word
error rate (WER) performance by upto 18% over subword prediction models.
Further, incorporating visual information improves performance with relative
gains upto 3.76% over audio only models.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based
architectures.Comment: Accepted for ACL 202
Uncertainty aware audiovisual activity recognition using deep Bayesian variational inference
Deep neural networks (DNNs) provide state-of-the-art results for a multitude
of applications, but the approaches using DNNs for multimodal audiovisual
applications do not consider predictive uncertainty associated with individual
modalities. Bayesian deep learning methods provide principled confidence and
quantify predictive uncertainty. Our contribution in this work is to propose an
uncertainty aware multimodal Bayesian fusion framework for activity
recognition. We demonstrate a novel approach that combines deterministic and
variational layers to scale Bayesian DNNs to deeper architectures. Our
experiments using in- and out-of-distribution samples selected from a subset of
Moments-in-Time (MiT) dataset show a more reliable confidence measure as
compared to the non-Bayesian baseline and the Monte Carlo dropout (MC dropout)
approximate Bayesian inference. We also demonstrate the uncertainty estimates
obtained from the proposed framework can identify out-of-distribution data on
the UCF101 and MiT datasets. In the multimodal setting, the proposed framework
improved precision-recall AUC by 10.2% on the subset of MiT dataset as compared
to non-Bayesian baseline.Comment: Accepted at ICCV 2019 for Oral presentatio
Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs
Visual and audiovisual speech recognition are witnessing a renaissance which
is largely due to the advent of deep learning methods. In this paper, we
present a deep learning architecture for lipreading and audiovisual word
recognition, which combines Residual Networks equipped with spatiotemporal
input layers and Bidirectional LSTMs. The lipreading architecture attains
11.92% misclassification rate on the challenging Lipreading-In-The-Wild
database, which is composed of excerpts from BBC-TV, each containing one of the
500 target words. Audiovisual experiments are performed using both intermediate
and late integration, as well as several types and levels of environmental
noise, and notable improvements over the audio-only network are reported, even
in the case of clean speech. A further analysis on the utility of target word
boundaries is provided, as well as on the capacity of the network in modeling
the linguistic context of the target word. Finally, we examine difficult word
pairs and discuss how visual information helps towards attaining higher
recognition accuracy.Comment: Accepted to Computer Vision and Image Understanding (Elsevier
Vision-Guided Robot Hearing
Natural human-robot interaction in complex and unpredictable environments is
one of the main research lines in robotics. In typical real-world scenarios,
humans are at some distance from the robot and the acquired signals are
strongly impaired by noise, reverberations and other interfering sources. In
this context, the detection and localisation of speakers plays a key role since
it is the pillar on which several tasks (e.g.: speech recognition and speaker
tracking) rely. We address the problem of how to detect and localize people
that are both seen and heard by a humanoid robot. We introduce a hybrid
deterministic/probabilistic model. Indeed, the deterministic component allows
us to map the visual information into the auditory space. By means of the
probabilistic component, the visual features guide the grouping of the auditory
features in order to form AV objects. The proposed model and the associated
algorithm are implemented in real-time (17 FPS) using a stereoscopic camera
pair and two microphones embedded into the head of the humanoid robot NAO. We
performed experiments on (i) synthetic data, (ii) a publicly available data set
and (iii) data acquired using the robot. The results we obtained validate the
approach and encourage us to further investigate how vision can help robot
hearing.Comment: 26 pages, many figures and tables, journa
Tensor Fusion Network for Multimodal Sentiment Analysis
Multimodal sentiment analysis is an increasingly popular research area, which
extends the conventional language-based definition of sentiment analysis to a
multimodal setup where other relevant modalities accompany language. In this
paper, we pose the problem of multimodal sentiment analysis as modeling
intra-modality and inter-modality dynamics. We introduce a novel model, termed
Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed
approach is tailored for the volatile nature of spoken language in online
videos as well as accompanying gestures and voice. In the experiments, our
model outperforms state-of-the-art approaches for both multimodal and unimodal
sentiment analysis.Comment: Accepted as full paper in EMNLP 201
DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction
This paper studies audio-visual deep saliency prediction. It introduces a
conceptually simple and effective Deep Audio-Visual Embedding for dynamic
saliency prediction dubbed ``DAVE" in conjunction with our efforts towards
building an Audio-Visual Eye-tracking corpus named ``AVE". Despite existing a
strong relation between auditory and visual cues for guiding gaze during
perception, video saliency models only consider visual cues and neglect the
auditory information that is ubiquitous in dynamic scenes. Here, we investigate
the applicability of audio cues in conjunction with visual ones in predicting
saliency maps using deep neural networks. To this end, the proposed model is
intentionally designed to be simple. Two baseline models are developed on the
same architecture which consists of an encoder-decoder. The encoder projects
the input into a feature space followed by a decoder that infers saliency. We
conduct an extensive analysis on different modalities and various aspects of
multi-model dynamic saliency prediction. Our results suggest that (1) audio is
a strong contributing cue for saliency prediction, (2) salient visible
sound-source is the natural cause of the superiority of our Audio-Visual model,
(3) richer feature representations for the input space leads to more powerful
predictions even in absence of more sophisticated saliency decoders, and (4)
Audio-Visual model improves over 53.54\% of the frames predicted by the best
Visual model (our baseline). Our endeavour demonstrates that audio is an
important cue that boosts dynamic video saliency prediction and helps models to
approach human performance. The code is available at
https://github.com/hrtavakoli/DAV
An Attempt towards Interpretable Audio-Visual Video Captioning
Automatically generating a natural language sentence to describe the content
of an input video is a very challenging problem. It is an essential multimodal
task in which auditory and visual contents are equally important. Although
audio information has been exploited to improve video captioning in previous
works, it is usually regarded as an additional feature fed into a black box
fusion machine. How are the words in the generated sentences associated with
the auditory and visual modalities? The problem is still not investigated. In
this paper, we make the first attempt to design an interpretable audio-visual
video captioning network to discover the association between words in sentences
and audio-visual sequences. To achieve this, we propose a multimodal
convolutional neural network-based audio-visual video captioning framework and
introduce a modality-aware module for exploring modality selection during
sentence generation. Besides, we collect new audio captioning and visual
captioning datasets for further exploring the interactions between auditory and
visual modalities for high-level video understanding. Extensive experiments
demonstrate that the modality-aware module makes our model interpretable on
modality selection during sentence generation. Even with the added
interpretability, our video captioning network can still achieve comparable
performance with recent state-of-the-art methods.Comment: 11 pages, 4 figure
Exploring the contextual factors affecting multimodal emotion recognition in videos
Emotional expressions form a key part of user behavior on today's digital
platforms. While multimodal emotion recognition techniques are gaining research
attention, there is a lack of deeper understanding on how visual and non-visual
features can be used to better recognize emotions in certain contexts, but not
others. This study analyzes the interplay between the effects of multimodal
emotion features derived from facial expressions, tone and text in conjunction
with two key contextual factors: i) gender of the speaker, and ii) duration of
the emotional episode. Using a large public dataset of 2,176 manually annotated
YouTube videos, we found that while multimodal features consistently
outperformed bimodal and unimodal features, their performance varied
significantly across different emotions, gender and duration contexts.
Multimodal features performed particularly better for male speakers in
recognizing most emotions. Furthermore, multimodal features performed
particularly better for shorter than for longer videos in recognizing neutral
and happiness, but not sadness and anger. These findings offer new insights
towards the development of more context-aware emotion recognition and
empathetic systems.Comment: Accepted version at IEEE Transactions on Affective Computin
- …