5,900 research outputs found
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
Multimedia retrieval plays an indispensable role in big data utilization.
Past efforts mainly focused on single-media retrieval. However, the
requirements of users are highly flexible, such as retrieving the relevant
audio clips with one query of image. So challenges stemming from the "media
gap", which means that representations of different media types are
inconsistent, have attracted increasing attention. Cross-media retrieval is
designed for the scenarios where the queries and retrieval results are of
different media types. As a relatively new research topic, its concepts,
methodologies and benchmarks are still not clear in the literatures. To address
these issues, we review more than 100 references, give an overview including
the concepts, methodologies, major challenges and open issues, as well as build
up the benchmarks including datasets and experimental results. Researchers can
directly adopt the benchmarks to promptly evaluate their proposed methods. This
will help them to focus on algorithm design, rather than the time-consuming
compared methods and results. It is noted that we have constructed a new
dataset XMedia, which is the first publicly available dataset with up to five
media types (text, image, video, audio and 3D model). We believe this overview
will attract more researchers to focus on cross-media retrieval and be helpful
to them.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
A review of EO image information mining
We analyze the state of the art of content-based retrieval in Earth
observation image archives focusing on complete systems showing promise for
operational implementation. The different paradigms at the basis of the main
system families are introduced. The approaches taken are analyzed, focusing in
particular on the phases after primitive feature extraction. The solutions
envisaged for the issues related to feature simplification and synthesis,
indexing, semantic labeling are reviewed. The methodologies for query
specification and execution are analyzed
Recent Advance in Content-based Image Retrieval: A Literature Survey
The explosive increase and ubiquitous accessibility of visual data on the Web
have led to the prosperity of research activity in image search or retrieval.
With the ignorance of visual content as a ranking clue, methods with text
search techniques for visual retrieval may suffer inconsistency between the
text words and visual content. Content-based image retrieval (CBIR), which
makes use of the representation of visual content to identify relevant images,
has attracted sustained attention in recent two decades. Such a problem is
challenging due to the intention gap and the semantic gap problems. Numerous
techniques have been developed for content-based image retrieval in the last
decade. The purpose of this paper is to categorize and evaluate those
algorithms proposed during the period of 2003 to 2016. We conclude with several
promising directions for future research.Comment: 22 page
Recent Advances in Zero-shot Recognition
With the recent renaissance of deep convolution neural networks, encouraging
breakthroughs have been achieved on the supervised recognition tasks, where
each class has sufficient training data and fully annotated training data.
However, to scale the recognition to a large number of classes with few or now
training samples for each class remains an unsolved problem. One approach to
scaling up the recognition is to develop models capable of recognizing unseen
categories without any training instances, or zero-shot recognition/ learning.
This article provides a comprehensive review of existing zero-shot recognition
techniques covering various aspects ranging from representations of models, and
from datasets and evaluation settings. We also overview related recognition
tasks including one-shot and open set recognition which can be used as natural
extensions of zero-shot recognition when limited number of class samples become
available or when zero-shot recognition is implemented in a real-world setting.
Importantly, we highlight the limitations of existing approaches and point out
future research directions in this existing new research area.Comment: accepted by IEEE Signal Processing Magazin
SCH-GAN: Semi-supervised Cross-modal Hashing by Generative Adversarial Network
Cross-modal hashing aims to map heterogeneous multimedia data into a common
Hamming space, which can realize fast and flexible retrieval across different
modalities. Supervised cross-modal hashing methods have achieved considerable
progress by incorporating semantic side information. However, they mainly have
two limitations: (1) Heavily rely on large-scale labeled cross-modal training
data which are labor intensive and hard to obtain. (2) Ignore the rich
information contained in the large amount of unlabeled data across different
modalities, especially the margin examples that are easily to be incorrectly
retrieved, which can help to model the correlations. To address these problems,
in this paper we propose a novel Semi-supervised Cross-Modal Hashing approach
by Generative Adversarial Network (SCH-GAN). We aim to take advantage of GAN's
ability for modeling data distributions to promote cross-modal hashing learning
in an adversarial way. The main contributions can be summarized as follows: (1)
We propose a novel generative adversarial network for cross-modal hashing. In
our proposed SCH-GAN, the generative model tries to select margin examples of
one modality from unlabeled data when giving a query of another modality. While
the discriminative model tries to distinguish the selected examples and true
positive examples of the query. These two models play a minimax game so that
the generative model can promote the hashing performance of discriminative
model. (2) We propose a reinforcement learning based algorithm to drive the
training of proposed SCH-GAN. The generative model takes the correlation score
predicted by discriminative model as a reward, and tries to select the examples
close to the margin to promote discriminative model by maximizing the margin
between positive and negative data. Experiments on 3 widely-used datasets
verify the effectiveness of our proposed approach.Comment: 12 pages, submitted to IEEE Transactions on Cybernetic
A Survey of Heterogeneous Information Network Analysis
Most real systems consist of a large number of interacting, multi-typed
components, while most contemporary researches model them as homogeneous
networks, without distinguishing different types of objects and links in the
networks. Recently, more and more researchers begin to consider these
interconnected, multi-typed data as heterogeneous information networks, and
develop structural analysis approaches by leveraging the rich semantic meaning
of structural types of objects and links in the networks. Compared to widely
studied homogeneous network, the heterogeneous information network contains
richer structure and semantic information, which provides plenty of
opportunities as well as a lot of challenges for data mining. In this paper, we
provide a survey of heterogeneous information network analysis. We will
introduce basic concepts of heterogeneous information network analysis, examine
its developments on different data mining tasks, discuss some advanced topics,
and point out some future research directions.Comment: 45 pages, 12 figure
Cross-modal Subspace Learning via Kernel Correlation Maximization and Discriminative Structure Preserving
The measure between heterogeneous data is still an open problem. Many
research works have been developed to learn a common subspace where the
similarity between different modalities can be calculated directly. However,
most of existing works focus on learning a latent subspace but the semantically
structural information is not well preserved. Thus, these approaches cannot get
desired results. In this paper, we propose a novel framework, termed
Cross-modal subspace learning via Kernel correlation maximization and
Discriminative structure-preserving (CKD), to solve this problem in two
aspects. Firstly, we construct a shared semantic graph to make each modality
data preserve the neighbor relationship semantically. Secondly, we introduce
the Hilbert-Schmidt Independence Criteria (HSIC) to ensure the consistency
between feature-similarity and semantic-similarity of samples. Our model not
only considers the inter-modality correlation by maximizing the kernel
correlation but also preserves the semantically structural information within
each modality. The extensive experiments are performed to evaluate the proposed
framework on the three public datasets. The experimental results demonstrated
that the proposed CKD is competitive compared with the classic subspace
learning methods.Comment: The paper is under consideration at Multimedia Tools and Application
Mining Associated Text and Images with Dual-Wing Harmoniums
We propose a multi-wing harmonium model for mining multimedia data that
extends and improves on earlier models based on two-layer random fields, which
capture bidirectional dependencies between hidden topic aspects and observed
inputs. This model can be viewed as an undirected counterpart of the two-layer
directed models such as LDA for similar tasks, but bears significant difference
in inference/learning cost tradeoffs, latent topic representations, and topic
mixing mechanisms. In particular, our model facilitates efficient inference and
robust topic mixing, and potentially provides high flexibilities in modeling
the latent topic spaces. A contrastive divergence and a variational algorithm
are derived for learning. We specialized our model to a dual-wing harmonium for
captioned images, incorporating a multivariate Poisson for word-counts and a
multivariate Gaussian for color histogram. We present empirical results on the
applications of this model to classification, retrieval and image annotation on
news video collections, and we report an extensive comparison with various
extant models.Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty
in Artificial Intelligence (UAI2005
Multi-Label Zero-Shot Learning via Concept Embedding
Zero Shot Learning (ZSL) enables a learning model to classify instances of an
unseen class during training. While most research in ZSL focuses on
single-label classification, few studies have been done in multi-label ZSL,
where an instance is associated with a set of labels simultaneously, due to the
difficulty in modeling complex semantics conveyed by a set of labels. In this
paper, we propose a novel approach to multi-label ZSL via concept embedding
learned from collections of public users' annotations of multimedia. Thanks to
concept embedding, multi-label ZSL can be done by efficiently mapping an
instance input features onto the concept embedding space in a similar manner
used in single-label ZSL. Moreover, our semantic learning model is capable of
embedding an out-of-vocabulary label by inferring its meaning from its
co-occurring labels. Thus, our approach allows both seen and unseen labels
during the concept embedding learning to be used in the aforementioned instance
mapping, which makes multi-label ZSL more flexible and suitable for real
applications. Experimental results of multi-label ZSL on images and music
tracks suggest that our approach outperforms a state-of-the-art multi-label ZSL
model and can deal with a scenario involving out-of-vocabulary labels without
re-training the semantics learning model.Comment: 15 pages. Technical Report 2016-06-01. School of Computer Science.
The University of Manchester. (Submitted to a Journal
- …