34,699 research outputs found
SoDeep: a Sorting Deep net to learn ranking loss surrogates
Several tasks in machine learning are evaluated using non-differentiable
metrics such as mean average precision or Spearman correlation. However, their
non-differentiability prevents from using them as objective functions in a
learning framework. Surrogate and relaxation methods exist but tend to be
specific to a given metric.
In the present work, we introduce a new method to learn approximations of
such non-differentiable objective functions. Our approach is based on a deep
architecture that approximates the sorting of arbitrary sets of scores. It is
trained virtually for free using synthetic data. This sorting deep (SoDeep) net
can then be combined in a plug-and-play manner with existing deep
architectures. We demonstrate the interest of our approach in three different
tasks that require ranking: Cross-modal text-image retrieval, multi-label image
classification and visual memorability ranking. Our approach yields very
competitive results on these three tasks, which validates the merit and the
flexibility of SoDeep as a proxy for sorting operation in ranking-based losses.Comment: Accepted to CVPR 201
Learning to Learn from Web Data through Deep Semantic Embeddings
In this paper we propose to learn a multimodal image and text embedding from
Web and Social Media data, aiming to leverage the semantic knowledge learnt in
the text domain and transfer it to a visual model for semantic image retrieval.
We demonstrate that the pipeline can learn from images with associated text
without supervision and perform a thourough analysis of five different text
embeddings in three different benchmarks. We show that the embeddings learnt
with Web and Social Media data have competitive performances over supervised
methods in the text based image retrieval task, and we clearly outperform state
of the art in the MIRFlickr dataset when training in the target data. Further
we demonstrate how semantic multimodal image retrieval can be performed using
the learnt embeddings, going beyond classical instance-level retrieval
problems. Finally, we present a new dataset, InstaCities1M, composed by
Instagram images and their associated texts that can be used for fair
comparison of image-text embeddings.Comment: ECCV MULA Workshop 201
A New Evaluation Protocol and Benchmarking Results for Extendable Cross-media Retrieval
This paper proposes a new evaluation protocol for cross-media retrieval which
better fits the real-word applications. Both image-text and text-image
retrieval modes are considered. Traditionally, class labels in the training and
testing sets are identical. That is, it is usually assumed that the query falls
into some pre-defined classes. However, in practice, the content of a query
image/text may vary extensively, and the retrieval system does not necessarily
know in advance the class label of a query. Considering the inconsistency
between the real-world applications and laboratory assumptions, we think that
the existing protocol that works under identical train/test classes can be
modified and improved.
This work is dedicated to addressing this problem by considering the protocol
under an extendable scenario, \ie, the training and testing classes do not
overlap. We provide extensive benchmarking results obtained by the existing
protocol and the proposed new protocol on several commonly used datasets. We
demonstrate a noticeable performance drop when the testing classes are unseen
during training. Additionally, a trivial solution, \ie, directly using the
predicted class label for cross-media retrieval, is tested. We show that the
trivial solution is very competitive in traditional non-extendable retrieval,
but becomes less so under the new settings. The train/test split, evaluation
code, and benchmarking results are publicly available on our website.Comment: 10 pages, 9 figure
Engineering Deep Representations for Modeling Aesthetic Perception
Many aesthetic models in computer vision suffer from two shortcomings: 1) the
low descriptiveness and interpretability of those hand-crafted aesthetic
criteria (i.e., nonindicative of region-level aesthetics), and 2) the
difficulty of engineering aesthetic features adaptively and automatically
toward different image sets. To remedy these problems, we develop a deep
architecture to learn aesthetically-relevant visual attributes from Flickr1,
which are localized by multiple textual attributes in a weakly-supervised
setting. More specifically, using a bag-ofwords (BoW) representation of the
frequent Flickr image tags, a sparsity-constrained subspace algorithm discovers
a compact set of textual attributes (e.g., landscape and sunset) for each
image. Then, a weakly-supervised learning algorithm projects the textual
attributes at image-level to the highly-responsive image patches at
pixel-level. These patches indicate where humans look at appealing regions with
respect to each textual attribute, which are employed to learn the visual
attributes. Psychological and anatomical studies have shown that humans
perceive visual concepts hierarchically. Hence, we normalize these patches and
feed them into a five-layer convolutional neural network (CNN) to mimick the
hierarchy of human perceiving the visual attributes. We apply the learned deep
features on image retargeting, aesthetics ranking, and retrieval. Both
subjective and objective experimental results thoroughly demonstrate the
competitiveness of our approach
Natural Disasters Detection in Social Media and Satellite imagery: a survey
The analysis of natural disaster-related multimedia content got great
attention in recent years. Being one of the most important sources of
information, social media have been crawled over the years to collect and
analyze disaster-related multimedia content. Satellite imagery has also been
widely explored for disasters analysis. In this paper, we survey the existing
literature on disaster detection and analysis of the retrieved information from
social media and satellites. Literature on disaster detection and analysis of
related multimedia content on the basis of the nature of the content can be
categorized into three groups, namely (i) disaster detection in text; (ii)
analysis of disaster-related visual content from social media; and (iii)
disaster detection in satellite imagery. We extensively review different
approaches proposed in these three domains. Furthermore, we also review
benchmarking datasets available for the evaluation of disaster detection
frameworks. Moreover, we provide a detailed discussion on the insights obtained
from the literature review, and identify future trends and challenges, which
will provide an important starting point for the researchers in the field
Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions
In this paper we tackle the problem of image search when the query is a short
textual description of the image the user is looking for. We choose to
implement the actual search process as a similarity search in a visual feature
space, by learning to translate a textual query into a visual representation.
Searching in the visual feature space has the advantage that any update to the
translation model does not require to reprocess the, typically huge, image
collection on which the search is performed. We propose Text2Vis, a neural
network that generates a visual representation, in the visual feature space of
the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis
optimizes two loss functions, using a stochastic loss-selection method. A
visual-focused loss is aimed at learning the actual text-to-visual feature
mapping, while a text-focused loss is aimed at modeling the higher-level
semantic concepts expressed in language and countering the overfit on
non-relevant visual components of the visual loss. We report preliminary
results on the MS-COCO dataset.Comment: Neu-IR '16 SIGIR Workshop on Neural Information Retrieval, July 21,
2016, Pisa, Ital
Pseudo-positive regularization for deep person re-identification
An intrinsic challenge of person re-identification (re-ID) is the annotation
difficulty. This typically means 1) few training samples per identity, and 2)
thus the lack of diversity among the training samples. Consequently, we face
high risk of over-fitting when training the convolutional neural network (CNN),
a state-of-the-art method in person re-ID. To reduce the risk of over-fitting,
this paper proposes a Pseudo Positive Regularization (PPR) method to enrich the
diversity of the training data. Specifically, unlabeled data from an
independent pedestrian database is retrieved using the target training data as
query. A small proportion of these retrieved samples are randomly selected as
the Pseudo Positive samples and added to the target training set for the
supervised CNN training. The addition of Pseudo Positive samples is therefore a
data augmentation method to reduce the risk of over-fitting during CNN
training. We implement our idea in the identification CNN models (i.e.,
CaffeNet, VGGNet-16 and ResNet-50). On CUHK03 and Market-1501 datasets,
experimental results demonstrate that the proposed method consistently improves
the baseline and yields competitive performance to the state-of-the-art person
re-ID methods.Comment: 12 pages, 6 figure
DimensionRank: Personal Neural Representations for Personalized General Search
Web Search and Social Media have always been two of the most important
applications on the internet. We begin by giving a unified framework, called
general search, of which which all search and social media products can be seen
as instances.
DimensionRank is our main contribution. This is an algorithm for personalized
general search, based on neural networks. DimensionRank's bold innovation is to
model and represent each user using their own unique personal neural
representation vector, a learned representation in a real-valued
multidimensional vector space. This is the first internet service we are aware
of that to model each user with their own independent representation vector.
This is also the first service we are aware of to attempt personalization for
general web search. Also, neural representations allows us to present the first
Reddit-style algorithm, that is immune to the problem of "brigading". We
believe personalized general search will yield a search product orders of
magnitude better than Google's one-size-fits-all web search algorithm.
Finally, we announce Deep Revelations, a new search and social network
internet application based on DimensionRank
Modality-dependent Cross-media Retrieval
In this paper, we investigate the cross-media retrieval between images and
text, i.e., using image to search text (I2T) and using text to search images
(T2I). Existing cross-media retrieval methods usually learn one couple of
projections, by which the original features of images and text can be projected
into a common latent space to measure the content similarity. However, using
the same projections for the two different retrieval tasks (I2T and T2I) may
lead to a tradeoff between their respective performances, rather than their
best performances. Different from previous works, we propose a
modality-dependent cross-media retrieval (MDCR) model, where two couples of
projections are learned for different cross-media retrieval tasks instead of
one couple of projections. Specifically, by jointly optimizing the correlation
between images and text and the linear regression from one modal space (image
or text) to the semantic space, two couples of mappings are learned to project
images and text from their original feature spaces into two common latent
subspaces (one for I2T and the other for T2I). Extensive experiments show the
superiority of the proposed MDCR compared with other methods. In particular,
based the 4,096 dimensional convolutional neural network (CNN) visual feature
and 100 dimensional LDA textual feature, the mAP of the proposed method
achieves 41.5\%, which is a new state-of-the-art performance on the Wikipedia
dataset.Comment: in ACM Transactions on Intelligent Systems and Technolog
Recent Advance in Content-based Image Retrieval: A Literature Survey
The explosive increase and ubiquitous accessibility of visual data on the Web
have led to the prosperity of research activity in image search or retrieval.
With the ignorance of visual content as a ranking clue, methods with text
search techniques for visual retrieval may suffer inconsistency between the
text words and visual content. Content-based image retrieval (CBIR), which
makes use of the representation of visual content to identify relevant images,
has attracted sustained attention in recent two decades. Such a problem is
challenging due to the intention gap and the semantic gap problems. Numerous
techniques have been developed for content-based image retrieval in the last
decade. The purpose of this paper is to categorize and evaluate those
algorithms proposed during the period of 2003 to 2016. We conclude with several
promising directions for future research.Comment: 22 page
- …