10,876 research outputs found
Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos
Single modality action recognition on RGB or depth sequences has been
extensively explored recently. It is generally accepted that each of these two
modalities has different strengths and limitations for the task of action
recognition. Therefore, analysis of the RGB+D videos can help us to better
study the complementary properties of these two types of modalities and achieve
higher levels of performance. In this paper, we propose a new deep autoencoder
based shared-specific feature factorization network to separate input
multimodal signals into a hierarchy of components. Further, based on the
structure of the features, a structured sparsity learning machine is proposed
which utilizes mixed norms to apply regularization within components and group
selection between them for better classification performance. Our experimental
results show the effectiveness of our cross-modality feature analysis framework
by achieving state-of-the-art accuracy for action classification on five
challenging benchmark datasets
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
Deep Sparse Coding for Invariant Multimodal Halle Berry Neurons
Deep feed-forward convolutional neural networks (CNNs) have become ubiquitous
in virtually all machine learning and computer vision challenges; however,
advancements in CNNs have arguably reached an engineering saturation point
where incremental novelty results in minor performance gains. Although there is
evidence that object classification has reached human levels on narrowly
defined tasks, for general applications, the biological visual system is far
superior to that of any computer. Research reveals there are numerous missing
components in feed-forward deep neural networks that are critical in mammalian
vision. The brain does not work solely in a feed-forward fashion, but rather
all of the neurons are in competition with each other; neurons are integrating
information in a bottom up and top down fashion and incorporating expectation
and feedback in the modeling process. Furthermore, our visual cortex is working
in tandem with our parietal lobe, integrating sensory information from various
modalities.
In our work, we sought to improve upon the standard feed-forward deep
learning model by augmenting them with biologically inspired concepts of
sparsity, top-down feedback, and lateral inhibition. We define our model as a
sparse coding problem using hierarchical layers. We solve the sparse coding
problem with an additional top-down feedback error driving the dynamics of the
neural network. While building and observing the behavior of our model, we were
fascinated that multimodal, invariant neurons naturally emerged that mimicked,
"Halle Berry neurons" found in the human brain. Furthermore, our sparse
representation of multimodal signals demonstrates qualitative and quantitative
superiority to the standard feed-forward joint embedding in common vision and
machine learning tasks
Dense Multimodal Fusion for Hierarchically Joint Representation
Multiple modalities can provide more valuable information than single one by
describing the same contents in various ways. Hence, it is highly expected to
learn effective joint representation by fusing the features of different
modalities. However, previous methods mainly focus on fusing the shallow
features or high-level representations generated by unimodal deep networks,
which only capture part of the hierarchical correlations across modalities. In
this paper, we propose to densely integrate the representations by greedily
stacking multiple shared layers between different modality-specific networks,
which is named as Dense Multimodal Fusion (DMF). The joint representations in
different shared layers can capture the correlations in different levels, and
the connection between shared layers also provides an efficient way to learn
the dependence among hierarchical correlations. These two properties jointly
contribute to the multiple learning paths in DMF, which results in faster
convergence, lower training loss, and better performance. We evaluate our model
on three typical multimodal learning tasks, including audiovisual speech
recognition, cross-modal retrieval, and multimodal classification. The
noticeable performance in the experiments demonstrates that our model can learn
more effective joint representation.Comment: 10 pages, 4 figure
Multimodal Deep Network Embedding with Integrated Structure and Attribute Information
Network embedding is the process of learning low-dimensional representations
for nodes in a network, while preserving node features. Existing studies only
leverage network structure information and focus on preserving structural
features. However, nodes in real-world networks often have a rich set of
attributes providing extra semantic information. It has been demonstrated that
both structural and attribute features are important for network analysis
tasks. To preserve both features, we investigate the problem of integrating
structure and attribute information to perform network embedding and propose a
Multimodal Deep Network Embedding (MDNE) method. MDNE captures the non-linear
network structures and the complex interactions among structures and
attributes, using a deep model consisting of multiple layers of non-linear
functions. Since structures and attributes are two different types of
information, a multimodal learning method is adopted to pre-process them and
help the model to better capture the correlations between node structure and
attribute information. We employ both structural proximity and attribute
proximity in the loss function to preserve the respective features and the
representations are obtained by minimizing the loss function. Results of
extensive experiments on four real-world datasets show that the proposed method
performs significantly better than baselines on a variety of tasks, which
demonstrate the effectiveness and generality of our method.Comment: 15 pages, 10 figure
Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network
Nowadays, cross-modal retrieval plays an indispensable role to flexibly find
information across different modalities of data. Effectively measuring the
similarity between different modalities of data is the key of cross-modal
retrieval. Different modalities such as image and text have imbalanced and
complementary relationships, which contain unequal amount of information when
describing the same semantics. For example, images often contain more details
that cannot be demonstrated by textual descriptions and vice versa. Existing
works based on Deep Neural Network (DNN) mostly construct one common space for
different modalities to find the latent alignments between them, which lose
their exclusive modality-specific characteristics. Different from the existing
works, we propose modality-specific cross-modal similarity measurement (MCSM)
approach by constructing independent semantic space for each modality, which
adopts end-to-end framework to directly generate modality-specific cross-modal
similarity without explicit common representation. For each semantic space,
modality-specific characteristics within one modality are fully exploited by
recurrent attention network, while the data of another modality is projected
into this space with attention based joint embedding to utilize the learned
attention weights for guiding the fine-grained cross-modal correlation
learning, which can capture the imbalanced and complementary relationships
between different modalities. Finally, the complementarity between the semantic
spaces for different modalities is explored by adaptive fusion of the
modality-specific cross-modal similarities to perform cross-modal retrieval.
Experiments on the widely-used Wikipedia and Pascal Sentence datasets as well
as our constructed large-scale XMediaNet dataset verify the effectiveness of
our proposed approach, outperforming 9 state-of-the-art methods.Comment: 13 pages, submitted to IEEE Transactions on Image Processin
Deep Multimodal Representation Learning from Temporal Data
In recent years, Deep Learning has been successfully applied to multimodal
learning problems, with the aim of learning useful joint representations in
data fusion applications. When the available modalities consist of time series
data such as video, audio and sensor signals, it becomes imperative to consider
their temporal structure during the fusion process. In this paper, we propose
the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion
model for fusing multiple input modalities that are inherently temporal in
nature. Key features of our proposed model include: (i) simultaneous learning
of the joint representation and temporal dependencies between modalities, (ii)
use of multiple loss terms in the objective function, including a maximum
correlation loss term to enhance learning of cross-modal information, and (iii)
the use of an attention model to dynamically adjust the contribution of
different input modalities to the joint representation. We validate our model
via experimentation on two different tasks: video- and sensor-based activity
classification, and audio-visual speech recognition. We empirically analyze the
contributions of different components of the proposed CorrRNN model, and
demonstrate its robustness, effectiveness and state-of-the-art performance on
multiple datasets.Comment: To appear in CVPR 201
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research
Sentiment analysis as a field has come a long way since it was first
introduced as a task nearly 20 years ago. It has widespread commercial
applications in various domains like marketing, risk management, market
research, and politics, to name a few. Given its saturation in specific
subtasks -- such as sentiment polarity classification -- and datasets, there is
an underlying perception that this field has reached its maturity. In this
article, we discuss this perception by pointing out the shortcomings and
under-explored, yet key aspects of this field that are necessary to attain true
sentiment understanding. We analyze the significant leaps responsible for its
current relevance. Further, we attempt to chart a possible course for this
field that covers many overlooked and unanswered questions.Comment: Published in the IEEE Transactions on Affective Computing (TAFFC
- …