1,187 research outputs found
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
Multimedia retrieval plays an indispensable role in big data utilization.
Past efforts mainly focused on single-media retrieval. However, the
requirements of users are highly flexible, such as retrieving the relevant
audio clips with one query of image. So challenges stemming from the "media
gap", which means that representations of different media types are
inconsistent, have attracted increasing attention. Cross-media retrieval is
designed for the scenarios where the queries and retrieval results are of
different media types. As a relatively new research topic, its concepts,
methodologies and benchmarks are still not clear in the literatures. To address
these issues, we review more than 100 references, give an overview including
the concepts, methodologies, major challenges and open issues, as well as build
up the benchmarks including datasets and experimental results. Researchers can
directly adopt the benchmarks to promptly evaluate their proposed methods. This
will help them to focus on algorithm design, rather than the time-consuming
compared methods and results. It is noted that we have constructed a new
dataset XMedia, which is the first publicly available dataset with up to five
media types (text, image, video, audio and 3D model). We believe this overview
will attract more researchers to focus on cross-media retrieval and be helpful
to them.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network
Nowadays, cross-modal retrieval plays an indispensable role to flexibly find
information across different modalities of data. Effectively measuring the
similarity between different modalities of data is the key of cross-modal
retrieval. Different modalities such as image and text have imbalanced and
complementary relationships, which contain unequal amount of information when
describing the same semantics. For example, images often contain more details
that cannot be demonstrated by textual descriptions and vice versa. Existing
works based on Deep Neural Network (DNN) mostly construct one common space for
different modalities to find the latent alignments between them, which lose
their exclusive modality-specific characteristics. Different from the existing
works, we propose modality-specific cross-modal similarity measurement (MCSM)
approach by constructing independent semantic space for each modality, which
adopts end-to-end framework to directly generate modality-specific cross-modal
similarity without explicit common representation. For each semantic space,
modality-specific characteristics within one modality are fully exploited by
recurrent attention network, while the data of another modality is projected
into this space with attention based joint embedding to utilize the learned
attention weights for guiding the fine-grained cross-modal correlation
learning, which can capture the imbalanced and complementary relationships
between different modalities. Finally, the complementarity between the semantic
spaces for different modalities is explored by adaptive fusion of the
modality-specific cross-modal similarities to perform cross-modal retrieval.
Experiments on the widely-used Wikipedia and Pascal Sentence datasets as well
as our constructed large-scale XMediaNet dataset verify the effectiveness of
our proposed approach, outperforming 9 state-of-the-art methods.Comment: 13 pages, submitted to IEEE Transactions on Image Processin
CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network
Cross-modal retrieval has become a highlighted research topic for retrieval
across multimedia data such as image and text. A two-stage learning framework
is widely adopted by most existing methods based on Deep Neural Network (DNN):
The first learning stage is to generate separate representation for each
modality, and the second learning stage is to get the cross-modal common
representation. However, the existing methods have three limitations: (1) In
the first learning stage, they only model intra-modality correlation, but
ignore inter-modality correlation with rich complementary context. (2) In the
second learning stage, they only adopt shallow networks with single-loss
regularization, but ignore the intrinsic relevance of intra-modality and
inter-modality correlation. (3) Only original instances are considered while
the complementary fine-grained clues provided by their patches are ignored. For
addressing the above problems, this paper proposes a cross-modal correlation
learning (CCL) approach with multi-grained fusion by hierarchical network, and
the contributions are as follows: (1) In the first learning stage, CCL exploits
multi-level association with joint optimization to preserve the complementary
context from intra-modality and inter-modality correlation simultaneously. (2)
In the second learning stage, a multi-task learning strategy is designed to
adaptively balance the intra-modality semantic category constraints and
inter-modality pairwise similarity constraints. (3) CCL adopts multi-grained
modeling, which fuses the coarse-grained instances and fine-grained patches to
make cross-modal correlation more precise. Comparing with 13 state-of-the-art
methods on 6 widely-used cross-modal datasets, the experimental results show
our CCL approach achieves the best performance.Comment: 16 pages, accepted by IEEE Transactions on Multimedi
Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks
There has been an explosion of multimodal content generated on social media
networks in the last few years, which has necessitated a deeper understanding
of social media content and user behavior. We present a novel
content-independent content-user-reaction model for social multimedia content
analysis. Compared to prior works that generally tackle semantic content
understanding and user behavior modeling in isolation, we propose a generalized
solution to these problems within a unified framework. We embed users, images
and text drawn from open social media in a common multimodal geometric space,
using a novel loss function designed to cope with distant and disparate
modalities, and thereby enable seamless three-way retrieval. Our model not only
outperforms unimodal embedding based methods on cross-modal retrieval tasks but
also shows improvements stemming from jointly solving the two tasks on Twitter
data. We also show that the user embeddings learned within our joint multimodal
embedding model are better at predicting user interests compared to those
learned with unimodal content on Instagram data. Our framework thus goes beyond
the prior practice of using explicit leader-follower link information to
establish affiliations by extracting implicit content-centric affiliations from
isolated users. We provide qualitative results to show that the user clusters
emerging from learned embeddings have consistent semantics and the ability of
our model to discover fine-grained semantics from noisy and unstructured data.
Our work reveals that social multimodal content is inherently multimodal and
possesses a consistent structure because in social networks meaning is created
through interactions between users and content.Comment: Preprint submitted to IJC
COBRA: Contrastive Bi-Modal Representation Algorithm
There are a wide range of applications that involve multi-modal data, such as
cross-modal retrieval, visual question-answering, and image captioning. Such
applications are primarily dependent on aligned distributions of the different
constituent modalities. Existing approaches generate latent embeddings for each
modality in a joint fashion by representing them in a common manifold. However
these joint embedding spaces fail to sufficiently reduce the modality gap,
which affects the performance in downstream tasks. We hypothesize that these
embeddings retain the intra-class relationships but are unable to preserve the
inter-class dynamics. In this paper, we present a novel framework COBRA that
aims to train two modalities (image and text) in a joint fashion inspired by
the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE)
paradigms which preserve both inter and intra-class relationships. We
empirically show that this framework reduces the modality gap significantly and
generates a robust and task agnostic joint-embedding space. We outperform
existing work on four diverse downstream tasks spanning across seven benchmark
cross-modal datasets.Comment: 13 Pages, 6 Figures and 10 Table
Exploring Auxiliary Context: Discrete Semantic Transfer Hashing for Scalable Image Retrieval
Unsupervised hashing can desirably support scalable content-based image
retrieval (SCBIR) for its appealing advantages of semantic label independence,
memory and search efficiency. However, the learned hash codes are embedded with
limited discriminative semantics due to the intrinsic limitation of image
representation. To address the problem, in this paper, we propose a novel
hashing approach, dubbed as \emph{Discrete Semantic Transfer Hashing} (DSTH).
The key idea is to \emph{directly} augment the semantics of discrete image hash
codes by exploring auxiliary contextual modalities. To this end, a unified
hashing framework is formulated to simultaneously preserve visual similarities
of images and perform semantic transfer from contextual modalities. Further, to
guarantee direct semantic transfer and avoid information loss, we explicitly
impose the discrete constraint, bit--uncorrelation constraint and bit-balance
constraint on hash codes. A novel and effective discrete optimization method
based on augmented Lagrangian multiplier is developed to iteratively solve the
optimization problem. The whole learning process has linear computation
complexity and desirable scalability. Experiments on three benchmark datasets
demonstrate the superiority of DSTH compared with several state-of-the-art
approaches
Task-adaptive Asymmetric Deep Cross-modal Hashing
Supervised cross-modal hashing aims to embed the semantic correlations of
heterogeneous modality data into the binary hash codes with discriminative
semantic labels. Because of its advantages on retrieval and storage efficiency,
it is widely used for solving efficient cross-modal retrieval. However,
existing researches equally handle the different tasks of cross-modal
retrieval, and simply learn the same couple of hash functions in a symmetric
way for them. Under such circumstance, the uniqueness of different cross-modal
retrieval tasks are ignored and sub-optimal performance may be brought.
Motivated by this, we present a Task-adaptive Asymmetric Deep Cross-modal
Hashing (TA-ADCMH) method in this paper. It can learn task-adaptive hash
functions for two sub-retrieval tasks via simultaneous modality representation
and asymmetric hash learning. Unlike previous cross-modal hashing approaches,
our learning framework jointly optimizes semantic preserving that transforms
deep features of multimedia data into binary hash codes, and the semantic
regression which directly regresses query modality representation to explicit
label. With our model, the binary codes can effectively preserve semantic
correlations across different modalities, meanwhile, adaptively capture the
query semantics. The superiority of TA-ADCMH is proved on two standard datasets
from many aspects
Shared Predictive Cross-Modal Deep Quantization
With explosive growth of data volume and ever-increasing diversity of data
modalities, cross-modal similarity search, which conducts nearest neighbor
search across different modalities, has been attracting increasing interest.
This paper presents a deep compact code learning solution for efficient
cross-modal similarity search. Many recent studies have proven that
quantization-based approaches perform generally better than hashing-based
approaches on single-modal similarity search. In this paper, we propose a deep
quantization approach, which is among the early attempts of leveraging deep
neural networks into quantization-based cross-modal similarity search. Our
approach, dubbed shared predictive deep quantization (SPDQ), explicitly
formulates a shared subspace across different modalities and two private
subspaces for individual modalities, and representations in the shared subspace
and the private subspaces are learned simultaneously by embedding them to a
reproducing kernel Hilbert space, where the mean embedding of different
modality distributions can be explicitly compared. In addition, in the shared
subspace, a quantizer is learned to produce the semantics preserving compact
codes with the help of label alignment. Thanks to this novel network
architecture in cooperation with supervised quantization training, SPDQ can
preserve intramodal and intermodal similarities as much as possible and greatly
reduce quantization error. Experiments on two popular benchmarks corroborate
that our approach outperforms state-of-the-art methods
Co-Learning Feature Fusion Maps from PET-CT Images of Lung Cancer
The analysis of multi-modality positron emission tomography and computed
tomography (PET-CT) images for computer aided diagnosis applications requires
combining the sensitivity of PET to detect abnormal regions with anatomical
localization from CT. Current methods for PET-CT image analysis either process
the modalities separately or fuse information from each modality based on
knowledge about the image analysis task. These methods generally do not
consider the spatially varying visual characteristics that encode different
information across the different modalities, which have different priorities at
different locations. For example, a high abnormal PET uptake in the lungs is
more meaningful for tumor detection than physiological PET uptake in the heart.
Our aim is to improve fusion of the complementary information in multi-modality
PET-CT with a new supervised convolutional neural network (CNN) that learns to
fuse complementary information for multi-modality medical image analysis. Our
CNN first encodes modality-specific features and then uses them to derive a
spatially varying fusion map that quantifies the relative importance of each
modality's features across different spatial locations. These fusion maps are
then multiplied with the modality-specific feature maps to obtain a
representation of the complementary multi-modality information at different
locations, which can then be used for image analysis. We evaluated the ability
of our CNN to detect and segment multiple regions with different fusion
requirements using a dataset of PET-CT images of lung cancer. We compared our
method to baseline techniques for multi-modality image fusion and segmentation.
Our findings show that our CNN had a significantly higher foreground detection
accuracy (99.29%, p < 0.05) than the fusion baselines and a significantly
higher Dice score (63.85%) than recent PET-CT tumor segmentation methods.Comment: Source code is available from https://github.com/ashnilkumar/colearn
. The paper has been accepted for publication in IEEE Transactions on Medical
Imaging. The final published version of the manuscript can be accessed from
the IEEE. The paper contains 21 pages (14 main paper, 7 supplementary), 16
images (8 main paper, 8 supplementary), and 3 table
Learning to Hash for Indexing Big Data - A Survey
The explosive growth in big data has attracted much attention in designing
efficient indexing and search methods recently. In many critical applications
such as large-scale search and pattern matching, finding the nearest neighbors
to a query is a fundamental research problem. However, the straightforward
solution using exhaustive comparison is infeasible due to the prohibitive
computational complexity and memory requirement. In response, Approximate
Nearest Neighbor (ANN) search based on hashing techniques has become popular
due to its promising performance in both efficiency and accuracy. Prior
randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore
data-independent hash functions with random projections or permutations.
Although having elegant theoretic guarantees on the search quality in certain
metric spaces, performance of randomized hashing has been shown insufficient in
many real-world applications. As a remedy, new approaches incorporating
data-driven learning methods in development of advanced hash functions have
emerged. Such learning to hash methods exploit information such as data
distributions or class labels when optimizing the hash codes or functions.
Importantly, the learned hash codes are able to preserve the proximity of
neighboring data in the original feature spaces in the hash code spaces. The
goal of this paper is to provide readers with systematic understanding of
insights, pros and cons of the emerging techniques. We provide a comprehensive
survey of the learning to hash framework and representative techniques of
various types, including unsupervised, semi-supervised, and supervised. In
addition, we also summarize recent hashing approaches utilizing the deep
learning models. Finally, we discuss the future direction and trends of
research in this area
- …