4 research outputs found
Visual Semantic Reasoning for Image-Text Matching
Image-text matching has been a hot research topic bridging the vision and
language areas. It remains challenging because the current representation of
image usually lacks global semantic concepts as in its corresponding text
caption. To address this issue, we propose a simple and interpretable reasoning
model to generate visual representation that captures key objects and semantic
concepts of a scene. Specifically, we first build up connections between image
regions and perform reasoning with Graph Convolutional Networks to generate
features with semantic relationships. Then, we propose to use the gate and
memory mechanism to perform global semantic reasoning on these
relationship-enhanced features, select the discriminative information and
gradually generate the representation for the whole scene. Experiments validate
that our method achieves a new state-of-the-art for the image-text matching on
MS-COCO and Flickr30K datasets. It outperforms the current best method by 6.8%
relatively for image retrieval and 4.8% relatively for caption retrieval on
MS-COCO (Recall@1 using 1K test set). On Flickr30K, our model improves image
retrieval by 12.6% relatively and caption retrieval by 5.8% relatively
(Recall@1). Our code is available at https://github.com/KunpengLi1994/VSRN.Comment: Accepted to ICCV 2019 (Oral
Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective
Zero-shot learning (ZSL) aims to recognize instances of unseen classes solely
based on the semantic descriptions of the classes. Existing algorithms usually
formulate it as a semantic-visual correspondence problem, by learning mappings
from one feature space to the other. Despite being reasonable, previous
approaches essentially discard the highly precious discriminative power of
visual features in an implicit way, and thus produce undesirable results. We
instead reformulate ZSL as a conditioned visual classification problem, i.e.,
classifying visual features based on the classifiers learned from the semantic
descriptions. With this reformulation, we develop algorithms targeting various
ZSL settings: For the conventional setting, we propose to train a deep neural
network that directly generates visual feature classifiers from the semantic
attributes with an episode-based training scheme; For the generalized setting,
we concatenate the learned highly discriminative classifiers for seen classes
and the generated classifiers for unseen classes to classify visual features of
all classes; For the transductive setting, we exploit unlabeled data to
effectively calibrate the classifier generator using a novel
learning-without-forgetting self-training mechanism and guide the process by a
robust generalized cross-entropy loss. Extensive experiments show that our
proposed algorithms significantly outperform state-of-the-art methods by large
margins on most benchmark datasets in all the ZSL settings. Our code is
available at \url{https://github.com/kailigo/cvcZSL}Comment: Accepted to ICCV 2019. First update: add project link and correct
some typo
Self-Supervised Difference Detection for Weakly-Supervised Semantic Segmentation
To minimize the annotation costs associated with the training of semantic
segmentation models, researchers have extensively investigated
weakly-supervised segmentation approaches. In the current weakly-supervised
segmentation methods, the most widely adopted approach is based on
visualization. However, the visualization results are not generally equal to
semantic segmentation. Therefore, to perform accurate semantic segmentation
under the weakly supervised condition, it is necessary to consider the mapping
functions that convert the visualization results into semantic segmentation.
For such mapping functions, the conditional random field and iterative
re-training using the outputs of a segmentation model are usually used.
However, these methods do not always guarantee improvements in accuracy;
therefore, if we apply these mapping functions iteratively multiple times,
eventually the accuracy will not improve or will decrease.
In this paper, to make the most of such mapping functions, we assume that the
results of the mapping function include noise, and we improve the accuracy by
removing noise. To achieve our aim, we propose the self-supervised difference
detection module, which estimates noise from the results of the mapping
functions by predicting the difference between the segmentation masks before
and after the mapping. We verified the effectiveness of the proposed method by
performing experiments on the PASCAL Visual Object Classes 2012 dataset, and we
achieved 64.9\% in the val set and 65.5\% in the test set. Both of the results
become new state-of-the-art under the same setting of weakly supervised
semantic segmentation.Comment: ICCV 2019, source codes: https://github.com/shimoda-uec/ssd
Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks
Deep neural models in recent years have been successful in almost every
field, including extremely complex problem statements. However, these models
are huge in size, with millions (and even billions) of parameters, thus
demanding more heavy computation power and failing to be deployed on edge
devices. Besides, the performance boost is highly dependent on redundant
labeled data. To achieve faster speeds and to handle the problems caused by the
lack of data, knowledge distillation (KD) has been proposed to transfer
information learned from one model to another. KD is often characterized by the
so-called `Student-Teacher' (S-T) learning framework and has been broadly
applied in model compression and knowledge transfer. This paper is about KD and
S-T learning, which are being actively studied in recent years. First, we aim
to provide explanations of what KD is and how/why it works. Then, we provide a
comprehensive survey on the recent progress of KD methods together with S-T
frameworks typically for vision tasks. In general, we consider some fundamental
questions that have been driving this research area and thoroughly generalize
the research progress and technical details. Additionally, we systematically
analyze the research status of KD in vision applications. Finally, we discuss
the potentials and open challenges of existing methods and prospect the future
directions of KD and S-T learning.Comment: Accepted to IEEE Transactions on Pattern Analysis and Machine
Intelligence(TPAMI),202