952 research outputs found
Exploring Auxiliary Context: Discrete Semantic Transfer Hashing for Scalable Image Retrieval
Unsupervised hashing can desirably support scalable content-based image
retrieval (SCBIR) for its appealing advantages of semantic label independence,
memory and search efficiency. However, the learned hash codes are embedded with
limited discriminative semantics due to the intrinsic limitation of image
representation. To address the problem, in this paper, we propose a novel
hashing approach, dubbed as \emph{Discrete Semantic Transfer Hashing} (DSTH).
The key idea is to \emph{directly} augment the semantics of discrete image hash
codes by exploring auxiliary contextual modalities. To this end, a unified
hashing framework is formulated to simultaneously preserve visual similarities
of images and perform semantic transfer from contextual modalities. Further, to
guarantee direct semantic transfer and avoid information loss, we explicitly
impose the discrete constraint, bit--uncorrelation constraint and bit-balance
constraint on hash codes. A novel and effective discrete optimization method
based on augmented Lagrangian multiplier is developed to iteratively solve the
optimization problem. The whole learning process has linear computation
complexity and desirable scalability. Experiments on three benchmark datasets
demonstrate the superiority of DSTH compared with several state-of-the-art
approaches
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
Multimedia retrieval plays an indispensable role in big data utilization.
Past efforts mainly focused on single-media retrieval. However, the
requirements of users are highly flexible, such as retrieving the relevant
audio clips with one query of image. So challenges stemming from the "media
gap", which means that representations of different media types are
inconsistent, have attracted increasing attention. Cross-media retrieval is
designed for the scenarios where the queries and retrieval results are of
different media types. As a relatively new research topic, its concepts,
methodologies and benchmarks are still not clear in the literatures. To address
these issues, we review more than 100 references, give an overview including
the concepts, methodologies, major challenges and open issues, as well as build
up the benchmarks including datasets and experimental results. Researchers can
directly adopt the benchmarks to promptly evaluate their proposed methods. This
will help them to focus on algorithm design, rather than the time-consuming
compared methods and results. It is noted that we have constructed a new
dataset XMedia, which is the first publicly available dataset with up to five
media types (text, image, video, audio and 3D model). We believe this overview
will attract more researchers to focus on cross-media retrieval and be helpful
to them.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
Attribute-Guided Network for Cross-Modal Zero-Shot Hashing
Zero-Shot Hashing aims at learning a hashing model that is trained only by
instances from seen categories but can generate well to those of unseen
categories. Typically, it is achieved by utilizing a semantic embedding space
to transfer knowledge from seen domain to unseen domain. Existing efforts
mainly focus on single-modal retrieval task, especially Image-Based Image
Retrieval (IBIR). However, as a highlighted research topic in the field of
hashing, cross-modal retrieval is more common in real world applications. To
address the Cross-Modal Zero-Shot Hashing (CMZSH) retrieval task, we propose a
novel Attribute-Guided Network (AgNet), which can perform not only IBIR, but
also Text-Based Image Retrieval (TBIR). In particular, AgNet aligns different
modal data into a semantically rich attribute space, which bridges the gap
caused by modality heterogeneity and zero-shot setting. We also design an
effective strategy that exploits the attribute to guide the generation of hash
codes for image and text within the same network. Extensive experimental
results on three benchmark datasets (AwA, SUN, and ImageNet) demonstrate the
superiority of AgNet on both cross-modal and single-modal zero-shot image
retrieval tasks.Comment: 9 pages, 8 figure
Shared Predictive Cross-Modal Deep Quantization
With explosive growth of data volume and ever-increasing diversity of data
modalities, cross-modal similarity search, which conducts nearest neighbor
search across different modalities, has been attracting increasing interest.
This paper presents a deep compact code learning solution for efficient
cross-modal similarity search. Many recent studies have proven that
quantization-based approaches perform generally better than hashing-based
approaches on single-modal similarity search. In this paper, we propose a deep
quantization approach, which is among the early attempts of leveraging deep
neural networks into quantization-based cross-modal similarity search. Our
approach, dubbed shared predictive deep quantization (SPDQ), explicitly
formulates a shared subspace across different modalities and two private
subspaces for individual modalities, and representations in the shared subspace
and the private subspaces are learned simultaneously by embedding them to a
reproducing kernel Hilbert space, where the mean embedding of different
modality distributions can be explicitly compared. In addition, in the shared
subspace, a quantizer is learned to produce the semantics preserving compact
codes with the help of label alignment. Thanks to this novel network
architecture in cooperation with supervised quantization training, SPDQ can
preserve intramodal and intermodal similarities as much as possible and greatly
reduce quantization error. Experiments on two popular benchmarks corroborate
that our approach outperforms state-of-the-art methods
Online Hashing
Although hash function learning algorithms have achieved great success in
recent years, most existing hash models are off-line, which are not suitable
for processing sequential or online data. To address this problem, this work
proposes an online hash model to accommodate data coming in stream for online
learning. Specifically, a new loss function is proposed to measure the
similarity loss between a pair of data samples in hamming space. Then, a
structured hash model is derived and optimized in a passive-aggressive way.
Theoretical analysis on the upper bound of the cumulative loss for the proposed
online hash model is provided. Furthermore, we extend our online hashing from a
single-model to a multi-model online hashing that trains multiple models so as
to retain diverse online hashing models in order to avoid biased update. The
competitive efficiency and effectiveness of the proposed online hash models are
verified through extensive experiments on several large-scale datasets as
compared to related hashing methods.Comment: To appear in IEEE Transactions on Neural Networks and Learning
Systems (DOI: 10.1109/TNNLS.2017.2689242
A New Evaluation Protocol and Benchmarking Results for Extendable Cross-media Retrieval
This paper proposes a new evaluation protocol for cross-media retrieval which
better fits the real-word applications. Both image-text and text-image
retrieval modes are considered. Traditionally, class labels in the training and
testing sets are identical. That is, it is usually assumed that the query falls
into some pre-defined classes. However, in practice, the content of a query
image/text may vary extensively, and the retrieval system does not necessarily
know in advance the class label of a query. Considering the inconsistency
between the real-world applications and laboratory assumptions, we think that
the existing protocol that works under identical train/test classes can be
modified and improved.
This work is dedicated to addressing this problem by considering the protocol
under an extendable scenario, \ie, the training and testing classes do not
overlap. We provide extensive benchmarking results obtained by the existing
protocol and the proposed new protocol on several commonly used datasets. We
demonstrate a noticeable performance drop when the testing classes are unseen
during training. Additionally, a trivial solution, \ie, directly using the
predicted class label for cross-media retrieval, is tested. We show that the
trivial solution is very competitive in traditional non-extendable retrieval,
but becomes less so under the new settings. The train/test split, evaluation
code, and benchmarking results are publicly available on our website.Comment: 10 pages, 9 figure
Deep Cross-Modal Hashing
Due to its low storage cost and fast query speed, cross-modal hashing (CMH)
has been widely used for similarity search in multimedia retrieval
applications. However, almost all existing CMH methods are based on
hand-crafted features which might not be optimally compatible with the
hash-code learning procedure. As a result, existing CMH methods with
handcrafted features may not achieve satisfactory performance. In this paper,
we propose a novel cross-modal hashing method, called deep crossmodal hashing
(DCMH), by integrating feature learning and hash-code learning into the same
framework. DCMH is an end-to-end learning framework with deep neural networks,
one for each modality, to perform feature learning from scratch. Experiments on
two real datasets with text-image modalities show that DCMH can outperform
other baselines to achieve the state-of-the-art performance in cross-modal
retrieval applications.Comment: 12 page
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Little research focuses on cross-modal correlation learning where temporal
structures of different data modalities such as audio and lyrics are taken into
account. Stemming from the characteristic of temporal structures of music in
nature, we are motivated to learn the deep sequential correlation between audio
and lyrics. In this work, we propose a deep cross-modal correlation learning
architecture involving two-branch deep neural networks for audio modality and
text modality (lyrics). Different modality data are converted to the same
canonical space where inter modal canonical correlation analysis is utilized as
an objective function to calculate the similarity of temporal structures. This
is the first study on understanding the correlation between language and music
audio through deep architectures for learning the paired temporal correlation
of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected
layers (fully-connected deep neural network) is used to represent lyrics. Two
significant contributions are made in the audio branch, as follows: i)
pre-trained CNN followed by fully-connected layers is investigated for
representing music audio. ii) We further suggest an end-to-end architecture
that simultaneously trains convolutional layers and fully-connected layers to
better learn temporal structures of music audio. Particularly, our end-to-end
deep architecture contains two properties: simultaneously implementing feature
learning and cross-modal correlation learning, and learning joint
representation by considering temporal structures. Experimental results, using
audio to retrieve lyrics or using lyrics to retrieve audio, verify the
effectiveness of the proposed deep correlation learning architectures in
cross-modal music retrieval
- …