26,005 research outputs found
Deep Multi-View Learning using Neuron-Wise Correlation-Maximizing Regularizers
Many machine learning problems concern with discovering or associating common
patterns in data of multiple views or modalities. Multi-view learning is of the
methods to achieve such goals. Recent methods propose deep multi-view networks
via adaptation of generic Deep Neural Networks (DNNs), which concatenate
features of individual views at intermediate network layers (i.e., fusion
layers). In this work, we study the problem of multi-view learning in such
end-to-end networks. We take a regularization approach via multi-view learning
criteria, and propose a novel, effective, and efficient neuron-wise
correlation-maximizing regularizer. We implement our proposed regularizers
collectively as a correlation-regularized network layer (CorrReg). CorrReg can
be applied to either fully-connected or convolutional fusion layers, simply by
replacing them with their CorrReg counterparts. By partitioning neurons of a
hidden layer in generic DNNs into multiple subsets, we also consider a
multi-view feature learning perspective of generic DNNs. Such a perspective
enables us to study deep multi-view learning in the context of regularized
network training, for which we present control experiments of benchmark image
classification to show the efficacy of our proposed CorrReg. To investigate how
CorrReg is useful for practical multi-view learning problems, we conduct
experiments of RGB-D object/scene recognition and multi-view based 3D object
recognition, using networks with fusion layers that concatenate intermediate
features of individual modalities or views for subsequent classification.
Applying CorrReg to fusion layers of these networks consistently improves
classification performance. In particular, we achieve the new state of the art
on the benchmark RGB-D object and RGB-D scene datasets. We make the
implementation of CorrReg publicly available
Equivariant Multi-View Networks
Several popular approaches to 3D vision tasks process multiple views of the
input independently with deep neural networks pre-trained on natural images,
achieving view permutation invariance through a single round of pooling over
all views. We argue that this operation discards important information and
leads to subpar global descriptors. In this paper, we propose a group
convolutional approach to multiple view aggregation where convolutions are
performed over a discrete subgroup of the rotation group, enabling, thus, joint
reasoning over all views in an equivariant (instead of invariant) fashion, up
to the very last layer. We further develop this idea to operate on smaller
discrete homogeneous spaces of the rotation group, where a polar view
representation is used to maintain equivariance with only a fraction of the
number of input views. We set the new state of the art in several large scale
3D shape retrieval tasks, and show additional applications to panoramic scene
classification.Comment: Camera-ready. Accepted to ICCV'19 as oral presentatio
Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos
Single modality action recognition on RGB or depth sequences has been
extensively explored recently. It is generally accepted that each of these two
modalities has different strengths and limitations for the task of action
recognition. Therefore, analysis of the RGB+D videos can help us to better
study the complementary properties of these two types of modalities and achieve
higher levels of performance. In this paper, we propose a new deep autoencoder
based shared-specific feature factorization network to separate input
multimodal signals into a hierarchy of components. Further, based on the
structure of the features, a structured sparsity learning machine is proposed
which utilizes mixed norms to apply regularization within components and group
selection between them for better classification performance. Our experimental
results show the effectiveness of our cross-modality feature analysis framework
by achieving state-of-the-art accuracy for action classification on five
challenging benchmark datasets
A Method Based on Convex Cone Model for Image-Set Classification with CNN Features
In this paper, we propose a method for image-set classification based on
convex cone models, focusing on the effectiveness of convolutional neural
network (CNN) features as inputs. CNN features have non-negative values when
using the rectified linear unit as an activation function. This naturally leads
us to model a set of CNN features by a convex cone and measure the geometric
similarity of convex cones for classification. To establish this framework, we
sequentially define multiple angles between two convex cones by repeating the
alternating least squares method and then define the geometric similarity
between the cones using the obtained angles. Moreover, to enhance our method,
we introduce a discriminant space, maximizing the between-class variance (gaps)
and minimizes the within-class variance of the projected convex cones onto the
discriminant space, similar to a Fisher discriminant analysis. Finally,
classification is based on the similarity between projected convex cones. The
effectiveness of the proposed method was demonstrated experimentally using a
private, multi-view hand shape dataset and two public databases.Comment: Accepted at the International Joint Conference on Neural Networks,
IJCNN, 201
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
Multimedia retrieval plays an indispensable role in big data utilization.
Past efforts mainly focused on single-media retrieval. However, the
requirements of users are highly flexible, such as retrieving the relevant
audio clips with one query of image. So challenges stemming from the "media
gap", which means that representations of different media types are
inconsistent, have attracted increasing attention. Cross-media retrieval is
designed for the scenarios where the queries and retrieval results are of
different media types. As a relatively new research topic, its concepts,
methodologies and benchmarks are still not clear in the literatures. To address
these issues, we review more than 100 references, give an overview including
the concepts, methodologies, major challenges and open issues, as well as build
up the benchmarks including datasets and experimental results. Researchers can
directly adopt the benchmarks to promptly evaluate their proposed methods. This
will help them to focus on algorithm design, rather than the time-consuming
compared methods and results. It is noted that we have constructed a new
dataset XMedia, which is the first publicly available dataset with up to five
media types (text, image, video, audio and 3D model). We believe this overview
will attract more researchers to focus on cross-media retrieval and be helpful
to them.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
Jointly Deep Multi-View Learning for Clustering Analysis
In this paper, we propose a novel Joint framework for Deep Multi-view
Clustering (DMJC), where multiple deep embedded features, multi-view fusion
mechanism and clustering assignments can be learned simultaneously. Our key
idea is that the joint learning strategy can sufficiently exploit
clustering-friendly multi-view features and useful multi-view complementary
information to improve the clustering performance. How to realize the
multi-view fusion in such a joint framework is the primary challenge. To do so,
we design two ingenious variants of deep multi-view joint clustering models
under the proposed framework, where multi-view fusion is implemented by two
different schemes. The first model, called DMJC-S, performs multi-view fusion
in an implicit way via a novel multi-view soft assignment distribution. The
second model, termed DMJC-T, defines a novel multi-view auxiliary target
distribution to conduct the multi-view fusion explicitly. Both DMJC-S and
DMJC-T are optimized under a KL divergence like clustering objective.
Experiments on six challenging image datasets demonstrate the superiority of
both DMJC-S and DMJC-T over single/multi-view baselines and the
state-of-the-art multiview clustering methods, which proves the effectiveness
of the proposed DMJC framework. To our best knowledge, this is the first work
to model the multi-view clustering in a deep joint framework, which will
provide a meaningful thinking in unsupervised multi-view learning.Comment: 10 pages, 4 figure
An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval
Due to the rapid development of mobile Internet techniques, cloud computation
and popularity of online social networking and location-based services, massive
amount of multimedia data with geographical information is generated and
uploaded to the Internet. In this paper, we propose a novel type of cross-modal
multimedia retrieval called geo-multimedia cross-modal retrieval which aims to
search out a set of geo-multimedia objects based on geographical distance
proximity and semantic similarity between different modalities. Previous
studies for cross-modal retrieval and spatial keyword search cannot address
this problem effectively because they do not consider multimedia data with
geo-tags and do not focus on this type of query. In order to address this
problem efficiently, we present the definition of NN geo-multimedia
cross-modal query at the first time and introduce relevant conceptions such as
cross-modal semantic representation space. To bridge the semantic gap between
different modalities, we propose a method named cross-modal semantic matching
which contains two important component, i.e., CorrProj and LogsTran, which aims
to construct a common semantic representation space for cross-modal semantic
similarity measurement. Besides, we designed a framework based on deep learning
techniques to implement common semantic representation space construction. In
addition, a novel hybrid indexing structure named GMR-Tree combining
geo-multimedia data and R-Tree is presented and a efficient NN search
algorithm called GMCMS is designed. Comprehensive experimental evaluation on
real and synthetic dataset clearly demonstrates that our solution outperforms
the-state-of-the-art methods.Comment: 27 page
Deep Canonically Correlated LSTMs
We examine Deep Canonically Correlated LSTMs as a way to learn nonlinear
transformations of variable length sequences and embed them into a correlated,
fixed dimensional space. We use LSTMs to transform multi-view time-series data
non-linearly while learning temporal relationships within the data. We then
perform correlation analysis on the outputs of these neural networks to find a
correlated subspace through which we get our final representation via
projection. This work follows from previous work done on Deep Canonical
Correlation (DCCA), in which deep feed-forward neural networks were used to
learn nonlinear transformations of data while maximizing correlation.Comment: 8 pages, 3 figures, accepted as the undergraduate honors thesis for
Neil Mallinar by The Johns Hopkins Universit
Tracking in Aerial Hyperspectral Videos using Deep Kernelized Correlation Filters
Hyperspectral imaging holds enormous potential to improve the
state-of-the-art in aerial vehicle tracking with low spatial and temporal
resolutions. Recently, adaptive multi-modal hyperspectral sensors have
attracted growing interest due to their ability to record extended data quickly
from aerial platforms. In this study, we apply popular concepts from
traditional object tracking, namely (1) Kernelized Correlation Filters (KCF)
and (2) Deep Convolutional Neural Network (CNN) features to aerial tracking in
hyperspectral domain. We propose the Deep Hyperspectral Kernelized Correlation
Filter based tracker (DeepHKCF) to efficiently track aerial vehicles using an
adaptive multi-modal hyperspectral sensor. We address low temporal resolution
by designing a single KCF-in-multiple Regions-of-Interest (ROIs) approach to
cover a reasonably large area. To increase the speed of deep convolutional
features extraction from multiple ROIs, we design an effective ROI mapping
strategy. The proposed tracker also provides flexibility to couple with the
more advanced correlation filter trackers. The DeepHKCF tracker performs
exceptionally well with deep features set up in a synthetic hyperspectral video
generated by the Digital Imaging and Remote Sensing Image Generation (DIRSIG)
software. Additionally, we generate a large, synthetic, single-channel dataset
using DIRSIG to perform vehicle classification in the Wide Area Motion Imagery
(WAMI) platform. This way, the high-fidelity of the DIRSIG software is proved
and a large scale aerial vehicle classification dataset is released to support
studies on vehicle detection and tracking in the WAMI platform
- …