10,430 research outputs found
Structure Optimization for Deep Multimodal Fusion Networks using Graph-Induced Kernels
A popular testbed for deep learning has been multimodal recognition of human
activity or gesture involving diverse inputs such as video, audio, skeletal
pose and depth images. Deep learning architectures have excelled on such
problems due to their ability to combine modality representations at different
levels of nonlinear feature extraction. However, designing an optimal
architecture in which to fuse such learned representations has largely been a
non-trivial human engineering effort. We treat fusion structure optimization as
a hyper-parameter search and cast it as a discrete optimization problem under
the Bayesian optimization framework. We propose a novel graph-induced kernel to
compute structural similarities in the search space of tree-structured
multimodal architectures and demonstrate its effectiveness using two
challenging multimodal human activity recognition datasets.Comment: Proceedings of the 25th European Symposium on Artificial Neural
Networks, Computational Intelligence and Machine Learning, April 2017,
Bruges, Belgiu
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
In this paper, we propose a multimodal search engine that combines visual and
textual cues to retrieve items from a multimedia database aesthetically similar
to the query. The goal of our engine is to enable intuitive retrieval of
fashion merchandise such as clothes or furniture. Existing search engines treat
textual input only as an additional source of information about the query image
and do not correspond to the real-life scenario where the user looks for 'the
same shirt but of denim'. Our novel method, dubbed DeepStyle, mitigates those
shortcomings by using a joint neural network architecture to model contextual
dependencies between features of different modalities. We prove the robustness
of this approach on two different challenging datasets of fashion items and
furniture where our DeepStyle engine outperforms baseline methods by 18-21% on
the tested datasets. Our search engine is commercially deployed and available
through a Web-based application.Comment: Copyright held by IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Alternating Diffusion Map Based Fusion of Multimodal Brain Connectivity Networks for IQ Prediction
To explain individual differences in development, behavior, and cognition,
most previous studies focused on projecting resting-state functional MRI (fMRI)
based functional connectivity (FC) data into a low-dimensional space via linear
dimensionality reduction techniques, followed by executing analysis operations.
However, linear dimensionality analysis techniques may fail to capture
nonlinearity of brain neuroactivity. Moreover, besides resting-state FC, FC
based on task fMRI can be expected to provide complementary information.
Motivated by these considerations, we nonlinearly fuse resting-state and
task-based FC networks (FCNs) to seek a better representation in this paper. We
propose a framework based on alternating diffusion map (ADM), which extracts
geometry-preserving low-dimensional embeddings that successfully parameterize
the intrinsic variables driving the phenomenon of interest. Specifically, we
first separately build resting-state and task-based FCNs by symmetric positive
definite matrices using sparse inverse covariance estimation for each subject,
and then utilize the ADM to fuse them in order to extract significant
low-dimensional embeddings, which are used as fingerprints to identify
individuals. The proposed framework is validated on the Philadelphia
Neurodevelopmental Cohort data, where we conduct extensive experimental study
on resting-state and fractal -back task fMRI for the classification of
intelligence quotient (IQ). The fusion of resting-state and -back task fMRI
by the proposed framework achieves better classification accuracy than any
single fMRI, and the proposed framework is shown to outperform several other
data fusion methods. To our knowledge, this paper is the first to demonstrate a
successful extension of the ADM to fuse resting-state and task-based fMRI data
for accurate prediction of IQ
CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network
Cross-modal retrieval has become a highlighted research topic for retrieval
across multimedia data such as image and text. A two-stage learning framework
is widely adopted by most existing methods based on Deep Neural Network (DNN):
The first learning stage is to generate separate representation for each
modality, and the second learning stage is to get the cross-modal common
representation. However, the existing methods have three limitations: (1) In
the first learning stage, they only model intra-modality correlation, but
ignore inter-modality correlation with rich complementary context. (2) In the
second learning stage, they only adopt shallow networks with single-loss
regularization, but ignore the intrinsic relevance of intra-modality and
inter-modality correlation. (3) Only original instances are considered while
the complementary fine-grained clues provided by their patches are ignored. For
addressing the above problems, this paper proposes a cross-modal correlation
learning (CCL) approach with multi-grained fusion by hierarchical network, and
the contributions are as follows: (1) In the first learning stage, CCL exploits
multi-level association with joint optimization to preserve the complementary
context from intra-modality and inter-modality correlation simultaneously. (2)
In the second learning stage, a multi-task learning strategy is designed to
adaptively balance the intra-modality semantic category constraints and
inter-modality pairwise similarity constraints. (3) CCL adopts multi-grained
modeling, which fuses the coarse-grained instances and fine-grained patches to
make cross-modal correlation more precise. Comparing with 13 state-of-the-art
methods on 6 widely-used cross-modal datasets, the experimental results show
our CCL approach achieves the best performance.Comment: 16 pages, accepted by IEEE Transactions on Multimedi
Relation-Aware Graph Attention Network for Visual Question Answering
In order to answer semantically-complicated questions about an image, a
Visual Question Answering (VQA) model needs to fully understand the visual
scene in the image, especially the interactive dynamics between different
objects. We propose a Relation-aware Graph Attention Network (ReGAT), which
encodes each image into a graph and models multi-type inter-object relations
via a graph attention mechanism, to learn question-adaptive relation
representations. Two types of visual object relations are explored: (i)
Explicit Relations that represent geometric positions and semantic interactions
between objects; and (ii) Implicit Relations that capture the hidden dynamics
between image regions. Experiments demonstrate that ReGAT outperforms prior
state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further
show that ReGAT is compatible to existing VQA architectures, and can be used as
a generic relation encoder to boost the model performance for VQA.Comment: To appear in ICCV 201
A Simple Baseline for Audio-Visual Scene-Aware Dialog
The recently proposed audio-visual scene-aware dialog task paves the way to a
more data-driven way of learning virtual assistants, smart speakers and car
navigation systems. However, very little is known to date about how to
effectively extract meaningful information from a plethora of sensors that
pound the computational engine of those devices. Therefore, in this paper, we
provide and carefully analyze a simple baseline for audio-visual scene-aware
dialog which is trained end-to-end. Our method differentiates in a data-driven
manner useful signals from distracting ones using an attention mechanism. We
evaluate the proposed approach on the recently introduced and challenging
audio-visual scene-aware dataset, and demonstrate the key features that permit
to outperform the current state-of-the-art by more than 20\% on CIDEr.Comment: Accepted to CVPR 201
Deep Multimodal Subspace Clustering Networks
We present convolutional neural network (CNN) based approaches for
unsupervised multimodal subspace clustering. The proposed framework consists of
three main stages - multimodal encoder, self-expressive layer, and multimodal
decoder. The encoder takes multimodal data as input and fuses them to a latent
space representation. The self-expressive layer is responsible for enforcing
the self-expressiveness property and acquiring an affinity matrix corresponding
to the data points. The decoder reconstructs the original input data. The
network uses the distance between the decoder's reconstruction and the original
input in its training. We investigate early, late and intermediate fusion
techniques and propose three different encoders corresponding to them for
spatial fusion. The self-expressive layers and multimodal decoders are
essentially the same for different spatial fusion-based approaches. In addition
to various spatial fusion-based methods, an affinity fusion-based network is
also proposed in which the self-expressive layer corresponding to different
modalities is enforced to be the same. Extensive experiments on three datasets
show that the proposed methods significantly outperform the state-of-the-art
multimodal subspace clustering methods
Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network
Nowadays, cross-modal retrieval plays an indispensable role to flexibly find
information across different modalities of data. Effectively measuring the
similarity between different modalities of data is the key of cross-modal
retrieval. Different modalities such as image and text have imbalanced and
complementary relationships, which contain unequal amount of information when
describing the same semantics. For example, images often contain more details
that cannot be demonstrated by textual descriptions and vice versa. Existing
works based on Deep Neural Network (DNN) mostly construct one common space for
different modalities to find the latent alignments between them, which lose
their exclusive modality-specific characteristics. Different from the existing
works, we propose modality-specific cross-modal similarity measurement (MCSM)
approach by constructing independent semantic space for each modality, which
adopts end-to-end framework to directly generate modality-specific cross-modal
similarity without explicit common representation. For each semantic space,
modality-specific characteristics within one modality are fully exploited by
recurrent attention network, while the data of another modality is projected
into this space with attention based joint embedding to utilize the learned
attention weights for guiding the fine-grained cross-modal correlation
learning, which can capture the imbalanced and complementary relationships
between different modalities. Finally, the complementarity between the semantic
spaces for different modalities is explored by adaptive fusion of the
modality-specific cross-modal similarities to perform cross-modal retrieval.
Experiments on the widely-used Wikipedia and Pascal Sentence datasets as well
as our constructed large-scale XMediaNet dataset verify the effectiveness of
our proposed approach, outperforming 9 state-of-the-art methods.Comment: 13 pages, submitted to IEEE Transactions on Image Processin
Multimodal Local-Global Ranking Fusion for Emotion Recognition
Emotion recognition is a core research area at the intersection of artificial
intelligence and human communication analysis. It is a significant technical
challenge since humans display their emotions through complex idiosyncratic
combinations of the language, visual and acoustic modalities. In contrast to
traditional multimodal fusion techniques, we approach emotion recognition from
both direct person-independent and relative person-dependent perspectives. The
direct person-independent perspective follows the conventional emotion
recognition approach which directly infers absolute emotion labels from
observed multimodal features. The relative person-dependent perspective
approaches emotion recognition in a relative manner by comparing partial video
segments to determine if there was an increase or decrease in emotional
intensity. Our proposed model integrates these direct and relative prediction
perspectives by dividing the emotion recognition task into three easier
subtasks. The first subtask involves a multimodal local ranking of relative
emotion intensities between two short segments of a video. The second subtask
uses local rankings to infer global relative emotion ranks with a Bayesian
ranking algorithm. The third subtask incorporates both direct predictions from
observed multimodal behaviors and relative emotion ranks from local-global
rankings for final emotion prediction. Our approach displays excellent
performance on an audio-visual emotion recognition benchmark and improves over
other algorithms for multimodal fusion.Comment: ACM International Conference on Multimodal Interaction (ICMI 2018
Multi-view Laplacian Eigenmaps Based on Bag-of-Neighbors For RGBD Human Emotion Recognition
Human emotion recognition is an important direction in the field of biometric
and information forensics. However, most existing human emotion research are
based on the single RGB view. In this paper, we introduce a RGBD video-emotion
dataset and a RGBD face-emotion dataset for research. To our best knowledge,
this may be the first RGBD video-emotion dataset. We propose a new supervised
nonlinear multi-view laplacian eigenmaps (MvLE) approach and a
multihidden-layer out-of-sample network (MHON) for RGB-D humanemotion
recognition. To get better representations of RGB view and depth view, MvLE is
used to map the training set of both views from original space into the common
subspace. As RGB view and depth view lie in different spaces, a new distance
metric bag of neighbors (BON) used in MvLE can get the similar distributions of
the two views. Finally, MHON is used to get the low-dimensional representations
of test data and predict their labels. MvLE can deal with the cases that RGB
view and depth view have different size of features, even different number of
samples and classes. And our methods can be easily extended to more than two
views. The experiment results indicate the effectiveness of our methods over
some state-of-art methods
- …