6,230 research outputs found
3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition
Audio-visual recognition (AVR) has been considered as a solution for speech
recognition tasks when the audio is corrupted, as well as a visual recognition
method used for speaker verification in multi-speaker scenarios. The approach
of AVR systems is to leverage the extracted information from one modality to
improve the recognition ability of the other modality by complementing the
missing information. The essential problem is to find the correspondence
between the audio and visual streams, which is the goal of this work. We
propose the use of a coupled 3D Convolutional Neural Network (3D-CNN)
architecture that can map both modalities into a representation space to
evaluate the correspondence of audio-visual streams using the learned
multimodal features. The proposed architecture will incorporate both spatial
and temporal information jointly to effectively find the correlation between
temporal information for different modalities. By using a relatively small
network architecture and much smaller dataset for training, our proposed method
surpasses the performance of the existing similar methods for audio-visual
matching which use 3D CNNs for feature representation. We also demonstrate that
an effective pair selection method can significantly increase the performance.
The proposed method achieves relative improvements over 20% on the Equal Error
Rate (EER) and over 7% on the Average Precision (AP) in comparison to the
state-of-the-art method
A Bimodal Learning Approach to Assist Multi-sensory Effects Synchronization
In mulsemedia applications, traditional media content (text, image, audio,
video, etc.) can be related to media objects that target other human senses
(e.g., smell, haptics, taste). Such applications aim at bridging the virtual
and real worlds through sensors and actuators. Actuators are responsible for
the execution of sensory effects (e.g., wind, heat, light), which produce
sensory stimulations on the users. In these applications sensory stimulation
must happen in a timely manner regarding the other traditional media content
being presented. For example, at the moment in which an explosion is presented
in the audiovisual content, it may be adequate to activate actuators that
produce heat and light. It is common to use some declarative multimedia
authoring language to relate the timestamp in which each media object is to be
presented to the execution of some sensory effect. One problem in this setting
is that the synchronization of media objects and sensory effects is done
manually by the author(s) of the application, a process which is time-consuming
and error prone. In this paper, we present a bimodal neural network
architecture to assist the synchronization task in mulsemedia applications. Our
approach is based on the idea that audio and video signals can be used
simultaneously to identify the timestamps in which some sensory effect should
be executed. Our learning architecture combines audio and video signals for the
prediction of scene components. For evaluation purposes, we construct a dataset
based on Google's AudioSet. We provide experiments to validate our bimodal
architecture. Our results show that the bimodal approach produces better
results when compared to several variants of unimodal architectures
Continuous Multimodal Emotion Recognition Approach for AVEC 2017
This paper reports the analysis of audio and visual features in predicting
the continuous emotion dimensions under the seventh Audio/Visual Emotion
Challenge (AVEC 2017), which was done as part of a B.Tech. 2nd year internship
project. For visual features we used the HOG (Histogram of Gradients) features,
Fisher encodings of SIFT (Scale-Invariant Feature Transform) features based on
Gaussian mixture model (GMM) and some pretrained Convolutional Neural Network
layers as features; all these extracted for each video clip. For audio features
we used the Bag-of-audio-words (BoAW) representation of the LLDs (low-level
descriptors) generated by openXBOW provided by the organisers of the event.
Then we trained fully connected neural network regression model on the dataset
for all these different modalities. We applied multimodal fusion on the output
models to get the Concordance correlation coefficient on Development set as
well as Test set.Comment: 4 pages, 3 figures, arXiv:1605.06778, arXiv:1512.0338
Updating the silent speech challenge benchmark with deep learning
The 2010 Silent Speech Challenge benchmark is updated with new results
obtained in a Deep Learning strategy, using the same input features and
decoding strategy as in the original article. A Word Error Rate of 6.4% is
obtained, compared to the published value of 17.4%. Additional results
comparing new auto-encoder-based features with the original features at reduced
dimensionality, as well as decoding scenarios on two different language models,
are also presented. The Silent Speech Challenge archive has been updated to
contain both the original and the new auto-encoder features, in addition to the
original raw data.Comment: 25 pages, 6 page
Modality Dropout for Improved Performance-driven Talking Faces
We describe our novel deep learning approach for driving animated faces using
both acoustic and visual information. In particular, speech-related facial
movements are generated using audiovisual information, and non-speech facial
movements are generated using only visual information. To ensure that our model
exploits both modalities during training, batches are generated that contain
audio-only, video-only, and audiovisual input features. The probability of
dropping a modality allows control over the degree to which the model exploits
audio and visual information during training. Our trained model runs in
real-time on resource limited hardware (e.g.\ a smart phone), it is user
agnostic, and it is not dependent on a potentially error-prone transcription of
the speech. We use subjective testing to demonstrate: 1) the improvement of
audiovisual-driven animation over the equivalent video-only approach, and 2)
the improvement in the animation of speech-related facial movements after
introducing modality dropout. Before introducing dropout, viewers prefer
audiovisual-driven animation in 51% of the test sequences compared with only
18% for video-driven. After introducing dropout viewer preference for
audiovisual-driven animation increases to 74%, but decreases to 8% for
video-only.Comment: Pre-prin
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
Audio-visual representation learning is an important task from the
perspective of designing machines with the ability to understand complex
events. To this end, we propose a novel multimodal framework that instantiates
multiple instance learning. We show that the learnt representations are useful
for classifying events and localizing their characteristic audio-visual
elements. The system is trained using only video-level event labels without any
timing information. An important feature of our method is its capacity to learn
from unsynchronized audio-visual events. We achieve state-of-the-art results on
a large-scale dataset of weakly-labeled audio event videos. Visualizations of
localized visual regions and audio segments substantiate our system's efficacy,
especially when dealing with noisy situations where modality-specific cues
appear asynchronously
Deep Learning for Sentiment Analysis : A Survey
Deep learning has emerged as a powerful machine learning technique that
learns multiple layers of representations or features of the data and produces
state-of-the-art prediction results. Along with the success of deep learning in
many other application domains, deep learning is also popularly used in
sentiment analysis in recent years. This paper first gives an overview of deep
learning and then provides a comprehensive survey of its current applications
in sentiment analysis.Comment: 34 pages, 9 figures, 2 table
Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning
Emotion recognition has become an important field of research in Human
Computer Interactions as we improve upon the techniques for modelling the
various aspects of behaviour. With the advancement of technology our
understanding of emotions are advancing, there is a growing need for automatic
emotion recognition systems. One of the directions the research is heading is
the use of Neural Networks which are adept at estimating complex functions that
depend on a large number and diverse source of input data. In this paper we
attempt to exploit this effectiveness of Neural networks to enable us to
perform multimodal Emotion recognition on IEMOCAP dataset using data from
Speech, Text, and Motion capture data from face expressions, rotation and hand
movements. Prior research has concentrated on Emotion detection from Speech on
the IEMOCAP dataset, but our approach is the first that uses the multiple modes
of data offered by IEMOCAP for a more robust and accurate emotion detection
An Attempt towards Interpretable Audio-Visual Video Captioning
Automatically generating a natural language sentence to describe the content
of an input video is a very challenging problem. It is an essential multimodal
task in which auditory and visual contents are equally important. Although
audio information has been exploited to improve video captioning in previous
works, it is usually regarded as an additional feature fed into a black box
fusion machine. How are the words in the generated sentences associated with
the auditory and visual modalities? The problem is still not investigated. In
this paper, we make the first attempt to design an interpretable audio-visual
video captioning network to discover the association between words in sentences
and audio-visual sequences. To achieve this, we propose a multimodal
convolutional neural network-based audio-visual video captioning framework and
introduce a modality-aware module for exploring modality selection during
sentence generation. Besides, we collect new audio captioning and visual
captioning datasets for further exploring the interactions between auditory and
visual modalities for high-level video understanding. Extensive experiments
demonstrate that the modality-aware module makes our model interpretable on
modality selection during sentence generation. Even with the added
interpretability, our video captioning network can still achieve comparable
performance with recent state-of-the-art methods.Comment: 11 pages, 4 figure
Deep Learning in Robotics: A Review of Recent Research
Advances in deep learning over the last decade have led to a flurry of
research in the application of deep artificial neural networks to robotic
systems, with at least thirty papers published on the subject between 2014 and
the present. This review discusses the applications, benefits, and limitations
of deep learning vis-\`a-vis physical robotic systems, using contemporary
research as exemplars. It is intended to communicate recent advances to the
wider robotics community and inspire additional interest in and application of
deep learning in robotics.Comment: 41 pages, 135 reference
- …