1,409 research outputs found
Large Margin Multi-modal Multi-task Feature Extraction for Image Classification
The features used in many image analysis-based applications are frequently of
very high dimension. Feature extraction offers several advantages in
high-dimensional cases, and many recent studies have used multi-task feature
extraction approaches, which often outperform single-task feature extraction
approaches. However, most of these methods are limited in that they only
consider data represented by a single type of feature, even though features
usually represent images from multiple modalities. We therefore propose a novel
large margin multi-modal multi-task feature extraction (LM3FE) framework for
handling multi-modal features for image classification. In particular, LM3FE
simultaneously learns the feature extraction matrix for each modality and the
modality combination coefficients. In this way, LM3FE not only handles
correlated and noisy features, but also utilizes the complementarity of
different modalities to further help reduce feature redundancy in each
modality. The large margin principle employed also helps to extract strongly
predictive features so that they are more suitable for prediction (e.g.,
classification). An alternating algorithm is developed for problem optimization
and each sub-problem can be efficiently solved. Experiments on two challenging
real-world image datasets demonstrate the effectiveness and superiority of the
proposed method
Multimodal Emotion Recognition Using Multimodal Deep Learning
To enhance the performance of affective models and reduce the cost of
acquiring physiological signals for real-world applications, we adopt
multimodal deep learning approach to construct affective models from multiple
physiological signals. For unimodal enhancement task, we indicate that the best
recognition accuracy of 82.11% on SEED dataset is achieved with shared
representations generated by Deep AutoEncoder (DAE) model. For multimodal
facilitation tasks, we demonstrate that the Bimodal Deep AutoEncoder (BDAE)
achieves the mean accuracies of 91.01% and 83.25% on SEED and DEAP datasets,
respectively, which are much superior to the state-of-the-art approaches. For
cross-modal learning task, our experimental results demonstrate that the mean
accuracy of 66.34% is achieved on SEED dataset through shared representations
generated by EEG-based DAE as training samples and shared representations
generated by eye-based DAE as testing sample, and vice versa
Multi-Modal Graph Interaction for Multi-Graph Convolution Network in Urban Spatiotemporal Forecasting
Graph convolution network based approaches have been recently used to model
region-wise relationships in region-level prediction problems in urban
computing. Each relationship represents a kind of spatial dependency, like
region-wise distance or functional similarity. To incorporate multiple
relationships into spatial feature extraction, we define the problem as a
multi-modal machine learning problem on multi-graph convolution networks.
Leveraging the advantage of multi-modal machine learning, we propose to develop
modality interaction mechanisms for this problem, in order to reduce
generalization error by reinforcing the learning of multimodal coordinated
representations. In this work, we propose two interaction techniques for
handling features in lower layers and higher layers respectively. In lower
layers, we propose grouped GCN to combine the graph connectivity from different
modalities for more complete spatial feature extraction. In higher layers, we
adapt multi-linear relationship networks to GCN by exploring the dimension
transformation and freezing part of the covariance structure. The adapted
approach, called multi-linear relationship GCN, learns more generalized
features to overcome the train-test divergence induced by time shifting. We
evaluated our model on ridehailing demand forecasting problem using two
real-world datasets. The proposed technique outperforms state-of-the art
baselines in terms of prediction accuracy, training efficiency,
interpretability and model robustness
Adaptive Regularization of Ill-Posed Problems: Application to Non-rigid Image Registration
We introduce an adaptive regularization approach. In contrast to conventional
Tikhonov regularization, which specifies a fixed regularization operator, we
estimate it simultaneously with parameters. From a Bayesian perspective we
estimate the prior distribution on parameters assuming that it is close to some
given model distribution. We constrain the prior distribution to be a
Gauss-Markov random field (GMRF), which allows us to solve for the prior
distribution analytically and provides a fast optimization algorithm. We apply
our approach to non-rigid image registration to estimate the spatial
transformation between two images. Our evaluation shows that the adaptive
regularization approach significantly outperforms standard variational methods
SCH-GAN: Semi-supervised Cross-modal Hashing by Generative Adversarial Network
Cross-modal hashing aims to map heterogeneous multimedia data into a common
Hamming space, which can realize fast and flexible retrieval across different
modalities. Supervised cross-modal hashing methods have achieved considerable
progress by incorporating semantic side information. However, they mainly have
two limitations: (1) Heavily rely on large-scale labeled cross-modal training
data which are labor intensive and hard to obtain. (2) Ignore the rich
information contained in the large amount of unlabeled data across different
modalities, especially the margin examples that are easily to be incorrectly
retrieved, which can help to model the correlations. To address these problems,
in this paper we propose a novel Semi-supervised Cross-Modal Hashing approach
by Generative Adversarial Network (SCH-GAN). We aim to take advantage of GAN's
ability for modeling data distributions to promote cross-modal hashing learning
in an adversarial way. The main contributions can be summarized as follows: (1)
We propose a novel generative adversarial network for cross-modal hashing. In
our proposed SCH-GAN, the generative model tries to select margin examples of
one modality from unlabeled data when giving a query of another modality. While
the discriminative model tries to distinguish the selected examples and true
positive examples of the query. These two models play a minimax game so that
the generative model can promote the hashing performance of discriminative
model. (2) We propose a reinforcement learning based algorithm to drive the
training of proposed SCH-GAN. The generative model takes the correlation score
predicted by discriminative model as a reward, and tries to select the examples
close to the margin to promote discriminative model by maximizing the margin
between positive and negative data. Experiments on 3 widely-used datasets
verify the effectiveness of our proposed approach.Comment: 12 pages, submitted to IEEE Transactions on Cybernetic
A Distance Map Regularized CNN for Cardiac Cine MR Image Segmentation
Cardiac image segmentation is a critical process for generating personalized
models of the heart and for quantifying cardiac performance parameters. Several
convolutional neural network (CNN) architectures have been proposed to segment
the heart chambers from cardiac cine MR images. Here we propose a multi-task
learning (MTL)-based regularization framework for cardiac MR image
segmentation. The network is trained to perform the main task of semantic
segmentation, along with a simultaneous, auxiliary task of pixel-wise distance
map regression. The proposed distance map regularizer is a decoder network
added to the bottleneck layer of an existing CNN architecture, facilitating the
network to learn robust global features. The regularizer block is removed after
training, so that the original number of network parameters does not change. We
show that the proposed regularization method improves both binary and
multi-class segmentation performance over the corresponding state-of-the-art
CNN architectures on two publicly available cardiac cine MRI datasets,
obtaining average dice coefficient of 0.840.03 and 0.910.04,
respectively. Furthermore, we also demonstrate improved generalization
performance of the distance map regularized network on cross-dataset
segmentation, showing as much as 42% improvement in myocardium Dice coefficient
from 0.560.28 to 0.800.14.Comment: 11 pages manuscript, 5 pages supplementary material
Exploring Auxiliary Context: Discrete Semantic Transfer Hashing for Scalable Image Retrieval
Unsupervised hashing can desirably support scalable content-based image
retrieval (SCBIR) for its appealing advantages of semantic label independence,
memory and search efficiency. However, the learned hash codes are embedded with
limited discriminative semantics due to the intrinsic limitation of image
representation. To address the problem, in this paper, we propose a novel
hashing approach, dubbed as \emph{Discrete Semantic Transfer Hashing} (DSTH).
The key idea is to \emph{directly} augment the semantics of discrete image hash
codes by exploring auxiliary contextual modalities. To this end, a unified
hashing framework is formulated to simultaneously preserve visual similarities
of images and perform semantic transfer from contextual modalities. Further, to
guarantee direct semantic transfer and avoid information loss, we explicitly
impose the discrete constraint, bit--uncorrelation constraint and bit-balance
constraint on hash codes. A novel and effective discrete optimization method
based on augmented Lagrangian multiplier is developed to iteratively solve the
optimization problem. The whole learning process has linear computation
complexity and desirable scalability. Experiments on three benchmark datasets
demonstrate the superiority of DSTH compared with several state-of-the-art
approaches
Deep Collective Matrix Factorization for Augmented Multi-View Learning
Learning by integrating multiple heterogeneous data sources is a common
requirement in many tasks. Collective Matrix Factorization (CMF) is a technique
to learn shared latent representations from arbitrary collections of matrices.
It can be used to simultaneously complete one or more matrices, for predicting
the unknown entries. Classical CMF methods assume linearity in the interaction
of latent factors which can be restrictive and fails to capture complex
non-linear interactions. In this paper, we develop the first deep-learning
based method, called dCMF, for unsupervised learning of multiple shared
representations, that can model such non-linear interactions, from an arbitrary
collection of matrices. We address optimization challenges that arise due to
dependencies between shared representations through Multi-Task Bayesian
Optimization and design an acquisition function adapted for collective learning
of hyperparameters. Our experiments show that dCMF significantly outperforms
previous CMF algorithms in integrating heterogeneous data for predictive
modeling. Further, on two tasks - recommendation and prediction of gene-disease
association - dCMF outperforms state-of-the-art matrix completion algorithms
that can utilize auxiliary sources of information
JECL: Joint Embedding and Cluster Learning for Image-Text Pairs
We propose JECL, a method for clustering image-caption pairs by training
parallel encoders with regularized clustering and alignment objectives,
simultaneously learning both representations and cluster assignments. These
image-caption pairs arise frequently in high-value applications where
structured training data is expensive to produce, but free-text descriptions
are common. JECL trains by minimizing the Kullback-Leibler divergence between
the distribution of the images and text to that of a combined joint target
distribution and optimizing the Jensen-Shannon divergence between the soft
cluster assignments of the images and text. Regularizers are also applied to
JECL to prevent trivial solutions. Experiments show that JECL outperforms both
single-view and multi-view methods on large benchmark image-caption datasets,
and is remarkably robust to missing captions and varying data sizes
Latent Variable Algorithms for Multimodal Learning and Sensor Fusion
Multimodal learning has been lacking principled ways of combining information
from different modalities and learning a low-dimensional manifold of meaningful
representations. We study multimodal learning and sensor fusion from a latent
variable perspective. We first present a regularized recurrent attention filter
for sensor fusion. This algorithm can dynamically combine information from
different types of sensors in a sequential decision making task. Each sensor is
bonded with a modular neural network to maximize utility of its own
information. A gating modular neural network dynamically generates a set of
mixing weights for outputs from sensor networks by balancing utility of all
sensors' information. We design a co-learning mechanism to encourage
co-adaption and independent learning of each sensor at the same time, and
propose a regularization based co-learning method. In the second part, we focus
on recovering the manifold of latent representation. We propose a co-learning
approach using probabilistic graphical model which imposes a structural prior
on the generative model: multimodal variational RNN (MVRNN) model, and derive a
variational lower bound for its objective functions. In the third part, we
extend the siamese structure to sensor fusion for robust acoustic event
detection. We perform experiments to investigate the latent representations
that are extracted; works will be done in the following months. Our experiments
show that the recurrent attention filter can dynamically combine different
sensor inputs according to the information carried in the inputs. We consider
MVRNN can identify latent representations that are useful for many downstream
tasks such as speech synthesis, activity recognition, and control and planning.
Both algorithms are general frameworks which can be applied to other tasks
where different types of sensors are jointly used for decision making
- …