9,531 research outputs found
Adversarial Learning for Fine-grained Image Search
Fine-grained image search is still a challenging problem due to the
difficulty in capturing subtle differences regardless of pose variations of
objects from fine-grained categories. In practice, a dynamic inventory with new
fine-grained categories adds another dimension to this challenge. In this work,
we propose an end-to-end network, called FGGAN, that learns discriminative
representations by implicitly learning a geometric transformation from
multi-view images for fine-grained image search. We integrate a generative
adversarial network (GAN) that can automatically handle complex view and pose
variations by converting them to a canonical view without any predefined
transformations. Moreover, in an open-set scenario, our network is able to
better match images from unseen and unknown fine-grained categories. Extensive
experiments on two public datasets and a newly collected dataset have
demonstrated the outstanding robust performance of the proposed FGGAN in both
closed-set and open-set scenarios, providing as much as 10% relative
improvement compared to baselines
Cross-modal Hallucination for Few-shot Fine-grained Recognition
State-of-the-art deep learning algorithms generally require large amounts of
data for model training. Lack thereof can severely deteriorate the performance,
particularly in scenarios with fine-grained boundaries between categories. To
this end, we propose a multimodal approach that facilitates bridging the
information gap by means of meaningful joint embeddings. Specifically, we
present a benchmark that is multimodal during training (i.e. images and texts)
and single-modal in testing time (i.e. images), with the associated task to
utilize multimodal data in base classes (with many samples), to learn explicit
visual classifiers for novel classes (with few samples). Next, we propose a
framework built upon the idea of cross-modal data hallucination. In this
regard, we introduce a discriminative text-conditional GAN for sample
generation with a simple self-paced strategy for sample selection. We show the
results of our proposed discriminative hallucinated method for 1-, 2-, and 5-
shot learning on the CUB dataset, where the accuracy is improved by employing
multimodal data.Comment: CVPR 2018 Workshop on Fine-Grained Visual Categorizatio
Ancient Painting to Natural Image: A New Solution for Painting Processing
Collecting a large-scale and well-annotated dataset for image processing has
become a common practice in computer vision. However, in the ancient painting
area, this task is not practical as the number of paintings is limited and
their style is greatly diverse. We, therefore, propose a novel solution for the
problems that come with ancient painting processing. This is to use domain
transfer to convert ancient paintings to photo-realistic natural images. By
doing so, the ancient painting processing problems become natural image
processing problems and models trained on natural images can be directly
applied to the transferred paintings. Specifically, we focus on Chinese ancient
flower, bird and landscape paintings in this work. A novel Domain Style
Transfer Network (DSTN) is proposed to transfer ancient paintings to natural
images which employ a compound loss to ensure that the transferred paintings
still maintain the color composition and content of the input paintings. The
experiment results show that the transferred paintings generated by the DSTN
have a better performance in both the human perceptual test and other image
processing tasks than other state-of-art methods, indicating the authenticity
of the transferred paintings and the superiority of the proposed method.Comment: 10 pages, 6 figures, published in WACV 201
Thinking Outside the Pool: Active Training Image Creation for Relative Attributes
Current wisdom suggests more labeled image data is always better, and
obtaining labels is the bottleneck. Yet curating a pool of sufficiently diverse
and informative images is itself a challenge. In particular, training image
curation is problematic for fine-grained attributes, where the subtle visual
differences of interest may be rare within traditional image sources. We
propose an active image generation approach to address this issue. The main
idea is to jointly learn the attribute ranking task while also learning to
generate novel realistic image samples that will benefit that task. We
introduce an end-to-end framework that dynamically "imagines" image pairs that
would confuse the current model, presents them to human annotators for
labeling, then improves the predictive model with the new examples. With
results on two datasets, we show that by thinking outside the pool of real
images, our approach gains generalization accuracy for challenging fine-grained
attribute comparisons
Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner
Impressive image captioning results are achieved in domains with plenty of
training image and sentence pairs (e.g., MSCOCO). However, transferring to a
target domain with significant domain shifts but no paired training data
(referred to as cross-domain image captioning) remains largely unexplored. We
propose a novel adversarial training procedure to leverage unpaired data in the
target domain. Two critic networks are introduced to guide the captioner,
namely domain critic and multi-modal critic. The domain critic assesses whether
the generated sentences are indistinguishable from sentences in the target
domain. The multi-modal critic assesses whether an image and its generated
sentence are a valid pair. During training, the critics and captioner act as
adversaries -- captioner aims to generate indistinguishable sentences, whereas
critics aim at distinguishing them. The assessment improves the captioner
through policy gradient updates. During inference, we further propose a novel
critic-based planning method to select high-quality sentences without
additional supervision (e.g., tags). To evaluate, we use MSCOCO as the source
domain and four other datasets (CUB-200-2011, Oxford-102, TGIF, and Flickr30k)
as the target domains. Our method consistently performs well on all datasets.
In particular, on CUB-200-2011, we achieve 21.8% CIDEr-D improvement after
adaptation. Utilizing critics during inference further gives another 4.5%
boost.Comment: ICCV 201
Fine-grained Visual-textual Representation Learning
Fine-grained visual categorization is to recognize hundreds of subcategories
belonging to the same basic-level category, which is a highly challenging task
due to the quite subtle and local visual distinctions among similar
subcategories. Most existing methods generally learn part detectors to discover
discriminative regions for better categorization performance. However, not all
parts are beneficial and indispensable for visual categorization, and the
setting of part detector number heavily relies on prior knowledge as well as
experimental validation. As is known to all, when we describe the object of an
image via textual descriptions, we mainly focus on the pivotal characteristics,
and rarely pay attention to common characteristics as well as the background
areas. This is an involuntary transfer from human visual attention to textual
attention, which leads to the fact that textual attention tells us how many and
which parts are discriminative and significant to categorization. So textual
attention could help us to discover visual attention in image. Inspired by
this, we propose a fine-grained visual-textual representation learning (VTRL)
approach, and its main contributions are: (1) Fine-grained visual-textual
pattern mining devotes to discovering discriminative visual-textual pairwise
information for boosting categorization performance through jointly modeling
vision and text with generative adversarial networks (GANs), which
automatically and adaptively discovers discriminative parts. (2) Visual-textual
representation learning jointly combines visual and textual information, which
preserves the intra-modality and inter-modality information to generate
complementary fine-grained representation, as well as further improves
categorization performance.Comment: 12 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technology (TCSVT
Joint Discriminative and Generative Learning for Person Re-identification
Person re-identification (re-id) remains challenging due to significant
intra-class variations across different cameras. Recently, there has been a
growing interest in using generative models to augment training data and
enhance the invariance to input changes. The generative pipelines in existing
methods, however, stay relatively separate from the discriminative re-id
learning stages. Accordingly, re-id models are often trained in a
straightforward manner on the generated data. In this paper, we seek to improve
learned re-id embeddings by better leveraging the generated data. To this end,
we propose a joint learning framework that couples re-id learning and data
generation end-to-end. Our model involves a generative module that separately
encodes each person into an appearance code and a structure code, and a
discriminative module that shares the appearance encoder with the generative
module. By switching the appearance or structure codes, the generative module
is able to generate high-quality cross-id composed images, which are online fed
back to the appearance encoder and used to improve the discriminative module.
The proposed joint learning framework renders significant improvement over the
baseline without using generated data, leading to the state-of-the-art
performance on several benchmark datasets.Comment: CVPR 2019 (Oral
Domain invariant hierarchical embedding for grocery products recognition
Recognizing packaged grocery products based solely on appearance is still an
open issue for modern computer vision systems due to peculiar challenges.
Firstly, the number of different items to be recognized is huge (i.e., in the
order of thousands) and rapidly changing over time. Moreover, there exist a
significant domain shift between the images that should be recognized at test
time, taken in stores by cheap cameras, and those available for training,
usually just one or a few studio-quality images per product. We propose an
end-to-end architecture comprising a GAN to address the domain shift at
training time and a deep CNN trained on the samples generated by the GAN to
learn an embedding of product images that enforces a hierarchy between product
categories. At test time, we perform recognition by means of K-NN search
against a database consisting of just one reference image per product.
Experiments addressing recognition of products present in the training datasets
as well as different ones unseen at training time show that our approach
compares favourably to state-of-the-art methods on the grocery recognition task
and generalize fairly well to similar ones
Interpreting Adversarial Examples with Attributes
Deep computer vision systems being vulnerable to imperceptible and carefully
crafted noise have raised questions regarding the robustness of their
decisions. We take a step back and approach this problem from an orthogonal
direction. We propose to enable black-box neural networks to justify their
reasoning both for clean and for adversarial examples by leveraging attributes,
i.e. visually discriminative properties of objects. We rank attributes based on
their class relevance, i.e. how the classification decision changes when the
input is visually slightly perturbed, as well as image relevance, i.e. how well
the attributes can be localized on both clean and perturbed images. We present
comprehensive experiments for attribute prediction, adversarial example
generation, adversarially robust learning, and their qualitative and
quantitative analysis using predicted attributes on three benchmark datasets
Open Logo Detection Challenge
Existing logo detection benchmarks consider artificial deployment scenarios
by assuming that large training data with fine-grained bounding box annotations
for each class are available for model training. Such assumptions are often
invalid in realistic logo detection scenarios where new logo classes come
progressively and require to be detected with little or none budget for
exhaustively labelling fine-grained training data for every new class. Existing
benchmarks are thus unable to evaluate the true performance of a logo detection
method in realistic and open deployments. In this work, we introduce a more
realistic and challenging logo detection setting, called Open Logo Detection.
Specifically, this new setting assumes fine-grained labelling only on a small
proportion of logo classes whilst the remaining classes have no labelled
training data to simulate the open deployment. We further create an open logo
detection benchmark, called OpenLogo,to promote the investigation of this new
challenge. OpenLogo contains 27,083 images from 352 logo classes, built by
aggregating/refining 7 existing datasets and establishing an open logo
detection evaluation protocol. To address this challenge, we propose a Context
Adversarial Learning (CAL) approach to synthesising training data with coherent
logo instance appearance against diverse background context for enabling more
effective optimisation of contemporary deep learning detection models.
Experiments show the performance advantage of CAL over existing
state-of-the-art alternative methods on the more realistic and challenging
OpenLogo benchmark.Comment: Accepted by BMVC 2018. The QMUL-OpenLogo benchmark is publicly
available at: qmul-openlogo.github.i
- …