395 research outputs found
Supervised Matrix Factorization for Cross-Modality Hashing
Matrix factorization has been recently utilized for the task of multi-modal
hashing for cross-modality visual search, where basis functions are learned to
map data from different modalities to the same Hamming embedding. In this
paper, we propose a novel cross-modality hashing algorithm termed Supervised
Matrix Factorization Hashing (SMFH) which tackles the multi-modal hashing
problem with a collective non-matrix factorization across the different
modalities. In particular, SMFH employs a well-designed binary code learning
algorithm to preserve the similarities among multi-modal original features
through a graph regularization. At the same time, semantic labels, when
available, are incorporated into the learning procedure. We conjecture that all
these would facilitate to preserve the most relevant information during the
binary quantization process, and hence improve the retrieval accuracy. We
demonstrate the superior performance of SMFH on three cross-modality visual
search benchmarks, i.e., the PASCAL-Sentence, Wiki, and NUS-WIDE, with
quantitative comparison to various state-of-the-art methodsComment: 7 pages, 4 figure
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
SCH-GAN: Semi-supervised Cross-modal Hashing by Generative Adversarial Network
Cross-modal hashing aims to map heterogeneous multimedia data into a common
Hamming space, which can realize fast and flexible retrieval across different
modalities. Supervised cross-modal hashing methods have achieved considerable
progress by incorporating semantic side information. However, they mainly have
two limitations: (1) Heavily rely on large-scale labeled cross-modal training
data which are labor intensive and hard to obtain. (2) Ignore the rich
information contained in the large amount of unlabeled data across different
modalities, especially the margin examples that are easily to be incorrectly
retrieved, which can help to model the correlations. To address these problems,
in this paper we propose a novel Semi-supervised Cross-Modal Hashing approach
by Generative Adversarial Network (SCH-GAN). We aim to take advantage of GAN's
ability for modeling data distributions to promote cross-modal hashing learning
in an adversarial way. The main contributions can be summarized as follows: (1)
We propose a novel generative adversarial network for cross-modal hashing. In
our proposed SCH-GAN, the generative model tries to select margin examples of
one modality from unlabeled data when giving a query of another modality. While
the discriminative model tries to distinguish the selected examples and true
positive examples of the query. These two models play a minimax game so that
the generative model can promote the hashing performance of discriminative
model. (2) We propose a reinforcement learning based algorithm to drive the
training of proposed SCH-GAN. The generative model takes the correlation score
predicted by discriminative model as a reward, and tries to select the examples
close to the margin to promote discriminative model by maximizing the margin
between positive and negative data. Experiments on 3 widely-used datasets
verify the effectiveness of our proposed approach.Comment: 12 pages, submitted to IEEE Transactions on Cybernetic
Learning Discriminative Representations for Semantic Cross Media Retrieval
Heterogeneous gap among different modalities emerges as one of the critical
issues in modern AI problems. Unlike traditional uni-modal cases, where raw
features are extracted and directly measured, the heterogeneous nature of cross
modal tasks requires the intrinsic semantic representation to be compared in a
unified framework. This paper studies the learning of different representations
that can be retrieved across different modality contents. A novel approach for
mining cross-modal representations is proposed by incorporating explicit linear
semantic projecting in Hilbert space. The insight is that the discriminative
structures of different modality data can be linearly represented in
appropriate high dimension Hilbert spaces, where linear operations can be used
to approximate nonlinear decisions in the original spaces. As a result, an
efficient linear semantic down mapping is jointly learned for multimodal data,
leading to a common space where they can be compared. The mechanism of "feature
up-lifting and down-projecting" works seamlessly as a whole, which accomplishes
crossmodal retrieval tasks very well. The proposed method, named as shared
discriminative semantic representation learning (\textbf{SDSRL}), is tested on
two public multimodal dataset for both within- and inter- modal retrieval. The
experiments demonstrate that it outperforms several state-of-the-art methods in
most scenarios
Unsupervised Cross-Media Hashing with Structure Preservation
Recent years have seen the exponential growth of heterogeneous multimedia
data. The need for effective and accurate data retrieval from heterogeneous
data sources has attracted much research interest in cross-media retrieval.
Here, given a query of any media type, cross-media retrieval seeks to find
relevant results of different media types from heterogeneous data sources. To
facilitate large-scale cross-media retrieval, we propose a novel unsupervised
cross-media hashing method. Our method incorporates local affinity and distance
repulsion constraints into a matrix factorization framework. Correspondingly,
the proposed method learns hash functions that generates unified hash codes
from different media types, while ensuring intrinsic geometric structure of the
data distribution is preserved. These hash codes empower the similarity between
data of different media types to be evaluated directly. Experimental results on
two large-scale multimedia datasets demonstrate the effectiveness of the
proposed method, where we outperform the state-of-the-art methods
Deep Cross-Modal Hashing
Due to its low storage cost and fast query speed, cross-modal hashing (CMH)
has been widely used for similarity search in multimedia retrieval
applications. However, almost all existing CMH methods are based on
hand-crafted features which might not be optimally compatible with the
hash-code learning procedure. As a result, existing CMH methods with
handcrafted features may not achieve satisfactory performance. In this paper,
we propose a novel cross-modal hashing method, called deep crossmodal hashing
(DCMH), by integrating feature learning and hash-code learning into the same
framework. DCMH is an end-to-end learning framework with deep neural networks,
one for each modality, to perform feature learning from scratch. Experiments on
two real datasets with text-image modalities show that DCMH can outperform
other baselines to achieve the state-of-the-art performance in cross-modal
retrieval applications.Comment: 12 page
Cross-Modality Hashing with Partial Correspondence
Learning a hashing function for cross-media search is very desirable due to
its low storage cost and fast query speed. However, the data crawled from
Internet cannot always guarantee good correspondence among different modalities
which affects the learning for hashing function. In this paper, we focus on
cross-modal hashing with partially corresponded data. The data without full
correspondence are made in use to enhance the hashing performance. The
experiments on Wiki and NUS-WIDE datasets demonstrates that the proposed method
outperforms some state-of-the-art hashing approaches with fewer correspondence
information
Shared Predictive Cross-Modal Deep Quantization
With explosive growth of data volume and ever-increasing diversity of data
modalities, cross-modal similarity search, which conducts nearest neighbor
search across different modalities, has been attracting increasing interest.
This paper presents a deep compact code learning solution for efficient
cross-modal similarity search. Many recent studies have proven that
quantization-based approaches perform generally better than hashing-based
approaches on single-modal similarity search. In this paper, we propose a deep
quantization approach, which is among the early attempts of leveraging deep
neural networks into quantization-based cross-modal similarity search. Our
approach, dubbed shared predictive deep quantization (SPDQ), explicitly
formulates a shared subspace across different modalities and two private
subspaces for individual modalities, and representations in the shared subspace
and the private subspaces are learned simultaneously by embedding them to a
reproducing kernel Hilbert space, where the mean embedding of different
modality distributions can be explicitly compared. In addition, in the shared
subspace, a quantizer is learned to produce the semantics preserving compact
codes with the help of label alignment. Thanks to this novel network
architecture in cooperation with supervised quantization training, SPDQ can
preserve intramodal and intermodal similarities as much as possible and greatly
reduce quantization error. Experiments on two popular benchmarks corroborate
that our approach outperforms state-of-the-art methods
Multimodal diffusion geometry by joint diagonalization of Laplacians
We construct an extension of diffusion geometry to multiple modalities
through joint approximate diagonalization of Laplacian matrices. This naturally
extends classical data analysis tools based on spectral geometry, such as
diffusion maps and spectral clustering. We provide several synthetic and real
examples of manifold learning, retrieval, and clustering demonstrating that the
joint diffusion geometry frequently better captures the inherent structure of
multi-modal data. We also show that many previous attempts to construct
multimodal spectral clustering can be seen as particular cases of joint
approximate diagonalization of the Laplacians
Triplet-Based Deep Hashing Network for Cross-Modal Retrieval
Given the benefits of its low storage requirements and high retrieval
efficiency, hashing has recently received increasing attention. In
particular,cross-modal hashing has been widely and successfully used in
multimedia similarity search applications. However, almost all existing methods
employing cross-modal hashing cannot obtain powerful hash codes due to their
ignoring the relative similarity between heterogeneous data that contains
richer semantic information, leading to unsatisfactory retrieval performance.
In this paper, we propose a triplet-based deep hashing (TDH) network for
cross-modal retrieval. First, we utilize the triplet labels, which describes
the relative relationships among three instances as supervision in order to
capture more general semantic correlations between cross-modal instances. We
then establish a loss function from the inter-modal view and the intra-modal
view to boost the discriminative abilities of the hash codes. Finally, graph
regularization is introduced into our proposed TDH method to preserve the
original semantic similarity between hash codes in Hamming space. Experimental
results show that our proposed method outperforms several state-of-the-art
approaches on two popular cross-modal datasets
- …