3,284 research outputs found
Unsupervised Generative Adversarial Cross-modal Hashing
Cross-modal hashing aims to map heterogeneous multimedia data into a common
Hamming space, which can realize fast and flexible retrieval across different
modalities. Unsupervised cross-modal hashing is more flexible and applicable
than supervised methods, since no intensive labeling work is involved. However,
existing unsupervised methods learn hashing functions by preserving inter and
intra correlations, while ignoring the underlying manifold structure across
different modalities, which is extremely helpful to capture meaningful nearest
neighbors of different modalities for cross-modal retrieval. To address the
above problem, in this paper we propose an Unsupervised Generative Adversarial
Cross-modal Hashing approach (UGACH), which makes full use of GAN's ability for
unsupervised representation learning to exploit the underlying manifold
structure of cross-modal data. The main contributions can be summarized as
follows: (1) We propose a generative adversarial network to model cross-modal
hashing in an unsupervised fashion. In the proposed UGACH, given a data of one
modality, the generative model tries to fit the distribution over the manifold
structure, and select informative data of another modality to challenge the
discriminative model. The discriminative model learns to distinguish the
generated data and the true positive data sampled from correlation graph to
achieve better retrieval accuracy. These two models are trained in an
adversarial way to improve each other and promote hashing function learning.
(2) We propose a correlation graph based approach to capture the underlying
manifold structure across different modalities, so that data of different
modalities but within the same manifold can have smaller Hamming distance and
promote retrieval accuracy. Extensive experiments compared with 6
state-of-the-art methods verify the effectiveness of our proposed approach.Comment: 8 pages, accepted by 32th AAAI Conference on Artificial Intelligence
(AAAI), 201
Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods
Feature extraction and dimensionality reduction are important tasks in many
fields of science dealing with signal processing and analysis. The relevance of
these techniques is increasing as current sensory devices are developed with
ever higher resolution, and problems involving multimodal data sources become
more common. A plethora of feature extraction methods are available in the
literature collectively grouped under the field of Multivariate Analysis (MVA).
This paper provides a uniform treatment of several methods: Principal Component
Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis
(CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions
derived by means of the theory of reproducing kernel Hilbert spaces. We also
review their connections to other methods for classification and statistical
dependence estimation, and introduce some recent developments to deal with the
extreme cases of large-scale and low-sized problems. To illustrate the wide
applicability of these methods in both classification and regression problems,
we analyze their performance in a benchmark of publicly available data sets,
and pay special attention to specific real applications involving audio
processing for music genre prediction and hyperspectral satellite images for
Earth and climate monitoring
Fine-graind Image Classification via Combining Vision and Language
Fine-grained image classification is a challenging task due to the large
intra-class variance and small inter-class variance, aiming at recognizing
hundreds of sub-categories belonging to the same basic-level category. Most
existing fine-grained image classification methods generally learn part
detection models to obtain the semantic parts for better classification
accuracy. Despite achieving promising results, these methods mainly have two
limitations: (1) not all the parts which obtained through the part detection
models are beneficial and indispensable for classification, and (2)
fine-grained image classification requires more detailed visual descriptions
which could not be provided by the part locations or attribute annotations. For
addressing the above two limitations, this paper proposes the two-stream model
combining vision and language (CVL) for learning latent semantic
representations. The vision stream learns deep representations from the
original visual information via deep convolutional neural network. The language
stream utilizes the natural language descriptions which could point out the
discriminative parts or characteristics for each image, and provides a flexible
and compact way of encoding the salient visual aspects for distinguishing
sub-categories. Since the two streams are complementary, combining the two
streams can further achieves better classification accuracy. Comparing with 12
state-of-the-art methods on the widely used CUB-200-2011 dataset for
fine-grained image classification, the experimental results demonstrate our CVL
approach achieves the best performance.Comment: 9 pages, to appear in CVPR 201
- …