32,144 research outputs found
Full-Network Embedding in a Multimodal Embedding Pipeline
The current state-of-the-art for image annotation and image retrieval tasks
is obtained through deep neural networks, which combine an image representation
and a text representation into a shared embedding space. In this paper we
evaluate the impact of using the Full-Network embedding in this setting,
replacing the original image representation in a competitive multimodal
embedding generation scheme. Unlike the one-layer image embeddings typically
used by most approaches, the Full-Network embedding provides a multi-scale
representation of images, which results in richer characterizations. To measure
the influence of the Full-Network embedding, we evaluate its performance on
three different datasets, and compare the results with the original multimodal
embedding generation scheme when using a one-layer image embedding, and with
the rest of the state-of-the-art. Results for image annotation and image
retrieval tasks indicate that the Full-Network embedding is consistently
superior to the one-layer embedding. These results motivate the integration of
the Full-Network embedding on any multimodal embedding generation scheme,
something feasible thanks to the flexibility of the approach.Comment: In 2nd Workshop on Semantic Deep Learning (SemDeep-2) at the 12th
International Conference on Computational Semantics (IWCS) 201
CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning
It is known that the inconsistent distribution and representation of
different modalities, such as image and text, cause the heterogeneity gap that
makes it challenging to correlate such heterogeneous data. Generative
adversarial networks (GANs) have shown its strong ability of modeling data
distribution and learning discriminative representation, existing GANs-based
works mainly focus on generative problem to generate new data. We have
different goal, aim to correlate heterogeneous data, by utilizing the power of
GANs to model cross-modal joint distribution. Thus, we propose Cross-modal GANs
to learn discriminative common representation for bridging heterogeneity gap.
The main contributions are: (1) Cross-modal GANs architecture is proposed to
model joint distribution over data of different modalities. The inter-modality
and intra-modality correlation can be explored simultaneously in generative and
discriminative models. Both of them beat each other to promote cross-modal
correlation learning. (2) Cross-modal convolutional autoencoders with
weight-sharing constraint are proposed to form generative model. They can not
only exploit cross-modal correlation for learning common representation, but
also preserve reconstruction information for capturing semantic consistency
within each modality. (3) Cross-modal adversarial mechanism is proposed, which
utilizes two kinds of discriminative models to simultaneously conduct
intra-modality and inter-modality discrimination. They can mutually boost to
make common representation more discriminative by adversarial training process.
To the best of our knowledge, our proposed CM-GANs approach is the first to
utilize GANs to perform cross-modal common representation learning. Experiments
are conducted to verify the performance of our proposed approach on cross-modal
retrieval paradigm, compared with 10 methods on 3 cross-modal datasets
Deep Semantic Ranking Based Hashing for Multi-Label Image Retrieval
With the rapid growth of web images, hashing has received increasing
interests in large scale image retrieval. Research efforts have been devoted to
learning compact binary codes that preserve semantic similarity based on
labels. However, most of these hashing methods are designed to handle simple
binary similarity. The complex multilevel semantic structure of images
associated with multiple labels have not yet been well explored. Here we
propose a deep semantic ranking based method for learning hash functions that
preserve multilevel semantic similarity between multi-label images. In our
approach, deep convolutional neural network is incorporated into hash functions
to jointly learn feature representations and mappings from them to hash codes,
which avoids the limitation of semantic representation power of hand-crafted
features. Meanwhile, a ranking list that encodes the multilevel similarity
information is employed to guide the learning of such deep hash functions. An
effective scheme based on surrogate loss is used to solve the intractable
optimization problem of nonsmooth and multivariate ranking measures involved in
the learning procedure. Experimental results show the superiority of our
proposed approach over several state-of-the-art hashing methods in term of
ranking evaluation metrics when tested on multi-label image datasets.Comment: CVPR 201
Unsupervised Semantic-based Aggregation of Deep Convolutional Features
In this paper, we propose a simple but effective semantic-based aggregation
(SBA) method. The proposed SBA utilizes the discriminative filters of deep
convolutional layers as semantic detectors. Moreover, we propose the effective
unsupervised strategy to select some semantic detectors to generate the
"probabilistic proposals", which highlight certain discriminative pattern of
objects and suppress the noise of background. The final global SBA
representation could then be acquired by aggregating the regional
representations weighted by the selected "probabilistic proposals"
corresponding to various semantic content. Our unsupervised SBA is easy to
generalize and achieves excellent performance on various tasks. We conduct
comprehensive experiments and show that our unsupervised SBA outperforms the
state-of-the-art unsupervised and supervised aggregation methods on image
retrieval, place recognition and cloud classification.Comment: 10 pages. arXiv admin note: text overlap with arXiv:1705.0124
Instance-Aware Hashing for Multi-Label Image Retrieval
Similarity-preserving hashing is a commonly used method for nearest neighbour
search in large-scale image retrieval. For image retrieval, deep-networks-based
hashing methods are appealing since they can simultaneously learn effective
image representations and compact hash codes. This paper focuses on
deep-networks-based hashing for multi-label images, each of which may contain
objects of multiple categories. In most existing hashing methods, each image is
represented by one piece of hash code, which is referred to as semantic
hashing. This setting may be suboptimal for multi-label image retrieval. To
solve this problem, we propose a deep architecture that learns
\textbf{instance-aware} image representations for multi-label image data, which
are organized in multiple groups, with each group containing the features for
one category. The instance-aware representations not only bring advantages to
semantic hashing, but also can be used in category-aware hashing, in which an
image is represented by multiple pieces of hash codes and each piece of code
corresponds to a category. Extensive evaluations conducted on several benchmark
datasets demonstrate that, for both semantic hashing and category-aware
hashing, the proposed method shows substantial improvement over the
state-of-the-art supervised and unsupervised hashing methods.Comment: has been accepted as a regular paper in the IEEE Transactions on
Image Processing, 201
CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network
Cross-modal retrieval has become a highlighted research topic for retrieval
across multimedia data such as image and text. A two-stage learning framework
is widely adopted by most existing methods based on Deep Neural Network (DNN):
The first learning stage is to generate separate representation for each
modality, and the second learning stage is to get the cross-modal common
representation. However, the existing methods have three limitations: (1) In
the first learning stage, they only model intra-modality correlation, but
ignore inter-modality correlation with rich complementary context. (2) In the
second learning stage, they only adopt shallow networks with single-loss
regularization, but ignore the intrinsic relevance of intra-modality and
inter-modality correlation. (3) Only original instances are considered while
the complementary fine-grained clues provided by their patches are ignored. For
addressing the above problems, this paper proposes a cross-modal correlation
learning (CCL) approach with multi-grained fusion by hierarchical network, and
the contributions are as follows: (1) In the first learning stage, CCL exploits
multi-level association with joint optimization to preserve the complementary
context from intra-modality and inter-modality correlation simultaneously. (2)
In the second learning stage, a multi-task learning strategy is designed to
adaptively balance the intra-modality semantic category constraints and
inter-modality pairwise similarity constraints. (3) CCL adopts multi-grained
modeling, which fuses the coarse-grained instances and fine-grained patches to
make cross-modal correlation more precise. Comparing with 13 state-of-the-art
methods on 6 widely-used cross-modal datasets, the experimental results show
our CCL approach achieves the best performance.Comment: 16 pages, accepted by IEEE Transactions on Multimedi
HashGAN:Attention-aware Deep Adversarial Hashing for Cross Modal Retrieval
As the rapid growth of multi-modal data, hashing methods for cross-modal
retrieval have received considerable attention. Deep-networks-based cross-modal
hashing methods are appealing as they can integrate feature learning and hash
coding into end-to-end trainable frameworks. However, it is still challenging
to find content similarities between different modalities of data due to the
heterogeneity gap. To further address this problem, we propose an adversarial
hashing network with attention mechanism to enhance the measurement of content
similarities by selectively focusing on informative parts of multi-modal data.
The proposed new adversarial network, HashGAN, consists of three building
blocks: 1) the feature learning module to obtain feature representations, 2)
the generative attention module to generate an attention mask, which is used to
obtain the attended (foreground) and the unattended (background) feature
representations, 3) the discriminative hash coding module to learn hash
functions that preserve the similarities between different modalities. In our
framework, the generative module and the discriminative module are trained in
an adversarial way: the generator is learned to make the discriminator cannot
preserve the similarities of multi-modal data w.r.t. the background feature
representations, while the discriminator aims to preserve the similarities of
multi-modal data w.r.t. both the foreground and the background feature
representations. Extensive evaluations on several benchmark datasets
demonstrate that the proposed HashGAN brings substantial improvements over
other state-of-the-art cross-modal hashing methods.Comment: 10 pages, 8 figures, 3 table
A Deep One-Shot Network for Query-based Logo Retrieval
Logo detection in real-world scene images is an important problem with
applications in advertisement and marketing. Existing general-purpose object
detection methods require large training data with annotations for every logo
class. These methods do not satisfy the incremental demand of logo classes
necessary for practical deployment since it is practically impossible to have
such annotated data for new unseen logo. In this work, we develop an
easy-to-implement query-based logo detection and localization system by
employing a one-shot learning technique. Given an image of a query logo, our
model searches for it within a given target image and predicts the possible
location of the logo by estimating a binary segmentation mask. The proposed
model consists of a conditional branch and a segmentation branch. The former
gives a conditional latent representation of the given query logo which is
combined with feature maps of the segmentation branch at multiple scales in
order to find the matching position of the query logo in a target image, should
it be present. Feature matching between the latent query representation and
multi-scale feature maps of segmentation branch using simple concatenation
operation followed by 1x1 convolution layer makes our model scale-invariant.
Despite its simplicity, our query-based logo retrieval framework achieved
superior performance in FlickrLogos-32 and TopLogos-10 dataset over different
existing baselines.Comment: Accepted in Pattern Recognition, Elsevier(2019
Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network
Nowadays, cross-modal retrieval plays an indispensable role to flexibly find
information across different modalities of data. Effectively measuring the
similarity between different modalities of data is the key of cross-modal
retrieval. Different modalities such as image and text have imbalanced and
complementary relationships, which contain unequal amount of information when
describing the same semantics. For example, images often contain more details
that cannot be demonstrated by textual descriptions and vice versa. Existing
works based on Deep Neural Network (DNN) mostly construct one common space for
different modalities to find the latent alignments between them, which lose
their exclusive modality-specific characteristics. Different from the existing
works, we propose modality-specific cross-modal similarity measurement (MCSM)
approach by constructing independent semantic space for each modality, which
adopts end-to-end framework to directly generate modality-specific cross-modal
similarity without explicit common representation. For each semantic space,
modality-specific characteristics within one modality are fully exploited by
recurrent attention network, while the data of another modality is projected
into this space with attention based joint embedding to utilize the learned
attention weights for guiding the fine-grained cross-modal correlation
learning, which can capture the imbalanced and complementary relationships
between different modalities. Finally, the complementarity between the semantic
spaces for different modalities is explored by adaptive fusion of the
modality-specific cross-modal similarities to perform cross-modal retrieval.
Experiments on the widely-used Wikipedia and Pascal Sentence datasets as well
as our constructed large-scale XMediaNet dataset verify the effectiveness of
our proposed approach, outperforming 9 state-of-the-art methods.Comment: 13 pages, submitted to IEEE Transactions on Image Processin
Cross-modal Deep Metric Learning with Multi-task Regularization
DNN-based cross-modal retrieval has become a research hotspot, by which users
can search results across various modalities like image and text. However,
existing methods mainly focus on the pairwise correlation and reconstruction
error of labeled data. They ignore the semantically similar and dissimilar
constraints between different modalities, and cannot take advantage of
unlabeled data. This paper proposes Cross-modal Deep Metric Learning with
Multi-task Regularization (CDMLMR), which integrates quadruplet ranking loss
and semi-supervised contrastive loss for modeling cross-modal semantic
similarity in a unified multi-task learning architecture. The quadruplet
ranking loss can model the semantically similar and dissimilar constraints to
preserve cross-modal relative similarity ranking information. The
semi-supervised contrastive loss is able to maximize the semantic similarity on
both labeled and unlabeled data. Compared to the existing methods, CDMLMR
exploits not only the similarity ranking information but also unlabeled
cross-modal data, and thus boosts cross-modal retrieval accuracy.Comment: Revision: Added reference [7] 6 pages, 1 figure, to appear in the
proceedings of the IEEE International Conference on Multimedia and Expo
(ICME), Jul 10, 2017 - Jul 14, 2017, Hong Kong, Hong Kon
- …