6 research outputs found
R-FCN-3000 at 30fps: Decoupling Detection and Classification
We present R-FCN-3000, a large-scale real-time object detector in which
objectness detection and classification are decoupled. To obtain the detection
score for an RoI, we multiply the objectness score with the fine-grained
classification score. Our approach is a modification of the R-FCN architecture
in which position-sensitive filters are shared across different object classes
for performing localization. For fine-grained classification, these
position-sensitive filters are not needed. R-FCN-3000 obtains an mAP of 34.9%
on the ImageNet detection dataset and outperforms YOLO-9000 by 18% while
processing 30 images per second. We also show that the objectness learned by
R-FCN-3000 generalizes to novel classes and the performance increases with the
number of training object classes - supporting the hypothesis that it is
possible to learn a universal objectness detector. Code will be made available.Comment: CVPR 2018 submissio
Semantic-aware Grad-GAN for Virtual-to-Real Urban Scene Adaption
Recent advances in vision tasks (e.g., segmentation) highly depend on the
availability of large-scale real-world image annotations obtained by cumbersome
human labors. Moreover, the perception performance often drops significantly
for new scenarios, due to the poor generalization capability of models trained
on limited and biased annotations. In this work, we resort to transfer
knowledge from automatically rendered scene annotations in virtual-world to
facilitate real-world visual tasks. Although virtual-world annotations can be
ideally diverse and unlimited, the discrepant data distributions between
virtual and real-world make it challenging for knowledge transferring. We thus
propose a novel Semantic-aware Grad-GAN (SG-GAN) to perform virtual-to-real
domain adaption with the ability of retaining vital semantic information.
Beyond the simple holistic color/texture transformation achieved by prior
works, SG-GAN successfully personalizes the appearance adaption for each
semantic region in order to preserve their key characteristic for better
recognition. It presents two main contributions to traditional GANs: 1) a soft
gradient-sensitive objective for keeping semantic boundaries; 2) a
semantic-aware discriminator for validating the fidelity of personalized
adaptions with respect to each semantic region. Qualitative and quantitative
experiments demonstrate the superiority of our SG-GAN in scene adaption over
state-of-the-art GANs. Further evaluations on semantic segmentation on
Cityscapes show using adapted virtual images by SG-GAN dramatically improves
segmentation performance than original virtual data. We release our code at
https://github.com/Peilun-Li/SG-GAN.Comment: In proceedings of BMVC 201
Pairwise Similarity Knowledge Transfer for Weakly Supervised Object Localization
Weakly Supervised Object Localization (WSOL) methods only require image level
labels as opposed to expensive bounding box annotations required by fully
supervised algorithms. We study the problem of learning localization model on
target classes with weakly supervised image labels, helped by a fully annotated
source dataset. Typically, a WSOL model is first trained to predict class
generic objectness scores on an off-the-shelf fully supervised source dataset
and then it is progressively adapted to learn the objects in the weakly
supervised target dataset. In this work, we argue that learning only an
objectness function is a weak form of knowledge transfer and propose to learn a
classwise pairwise similarity function that directly compares two input
proposals as well. The combined localization model and the estimated object
annotations are jointly learned in an alternating optimization paradigm as is
typically done in standard WSOL methods. In contrast to the existing work that
learns pairwise similarities, our approach optimizes a unified objective with
convergence guarantee and it is computationally efficient for large-scale
applications. Experiments on the COCO and ILSVRC 2013 detection datasets show
that the performance of the localization model improves significantly with the
inclusion of pairwise similarity function. For instance, in the ILSVRC dataset,
the Correct Localization (CorLoc) performance improves from 72.8% to 78.2%
which is a new state-of-the-art for WSOL task in the context of knowledge
transfer.Comment: ECCV 2020. formerly "In Defense of Graph Inference Algorithms for
Weakly Supervised Object Localization
What leads to generalization of object proposals?
Object proposal generation is often the first step in many detection models.
It is lucrative to train a good proposal model, that generalizes to unseen
classes. This could help scaling detection models to larger number of classes
with fewer annotations. Motivated by this, we study how a detection model
trained on a small set of source classes can provide proposals that generalize
to unseen classes. We systematically study the properties of the dataset -
visual diversity and label space granularity - required for good
generalization. We show the trade-off between using fine-grained labels and
coarse labels. We introduce the idea of prototypical classes: a set of
sufficient and necessary classes required to train a detection model to obtain
generalized proposals in a more data-efficient way. On the Open Images V4
dataset, we show that only 25% of the classes can be selected to form such a
prototypical set. The resulting proposals from a model trained with these
classes is only 4.3% worse than using all the classes, in terms of average
recall (AR). We also demonstrate that Faster R-CNN model leads to better
generalization of proposals compared to a single-stage network like RetinaNet
Boosting Weakly Supervised Object Detection with Progressive Knowledge Transfer
In this paper, we propose an effective knowledge transfer framework to boost
the weakly supervised object detection accuracy with the help of an external
fully-annotated source dataset, whose categories may not overlap with the
target domain. This setting is of great practical value due to the existence of
many off-the-shelf detection datasets. To more effectively utilize the source
dataset, we propose to iteratively transfer the knowledge from the source
domain by a one-class universal detector and learn the target-domain detector.
The box-level pseudo ground truths mined by the target-domain detector in each
iteration effectively improve the one-class universal detector. Therefore, the
knowledge in the source dataset is more thoroughly exploited and leveraged.
Extensive experiments are conducted with Pascal VOC 2007 as the target
weakly-annotated dataset and COCO/ImageNet as the source fully-annotated
dataset. With the proposed solution, we achieved an mAP of detection
performance on the VOC test set and an mAP of after retraining a fully
supervised Faster RCNN with the mined pseudo ground truths. This is
significantly better than any previously known results in related literature
and sets a new state-of-the-art of weakly supervised object detection under the
knowledge transfer setting. Code:
\url{https://github.com/mikuhatsune/wsod_transfer}.Comment: ECCV 2020. Code: https://github.com/mikuhatsune/wsod_transfe
AMIL: Adversarial Multi Instance Learning for Human Pose Estimation
Human pose estimation has an important impact on a wide range of applications
from human-computer interface to surveillance and content-based video
retrieval. For human pose estimation, joint obstructions and overlapping upon
human bodies result in departed pose estimation. To address these problems, by
integrating priors of the structure of human bodies, we present a novel
structure-aware network to discreetly consider such priors during the training
of the network. Typically, learning such constraints is a challenging task.
Instead, we propose generative adversarial networks as our learning model in
which we design two residual multiple instance learning (MIL) models with the
identical architecture, one is used as the generator and the other one is used
as the discriminator. The discriminator task is to distinguish the actual poses
from the fake ones. If the pose generator generates the results that the
discriminator is not able to distinguish from the real ones, the model has
successfully learnt the priors. In the proposed model, the discriminator
differentiates the ground-truth heatmaps from the generated ones, and later the
adversarial loss back-propagates to the generator. Such procedure assists the
generator to learn reasonable body configurations and is proved to be
advantageous to improve the pose estimation accuracy. Meanwhile, we propose a
novel function for MIL. It is an adjustable structure for both instance
selection and modeling to appropriately pass the information between instances
in a single bag. In the proposed residual MIL neural network, the pooling
action adequately updates the instance contribution to its bag. The proposed
adversarial residual multi-instance neural network that is based on pooling has
been validated on two datasets for the human pose estimation task and
successfully outperforms the other state-of-arts models