1,453 research outputs found
A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection
How do we learn an object detector that is invariant to occlusions and
deformations? Our current solution is to use a data-driven strategy -- collect
large-scale datasets which have object instances under different conditions.
The hope is that the final classifier can use these examples to learn
invariances. But is it really possible to see all the occlusions in a dataset?
We argue that like categories, occlusions and object deformations also follow a
long-tail. Some occlusions and deformations are so rare that they hardly
happen; yet we want to learn a model invariant to such occurrences. In this
paper, we propose an alternative solution. We propose to learn an adversarial
network that generates examples with occlusions and deformations. The goal of
the adversary is to generate examples that are difficult for the object
detector to classify. In our framework both the original detector and adversary
are learned in a joint manner. Our experimental results indicate a 2.3% mAP
boost on VOC07 and a 2.6% mAP boost on VOC2012 object detection challenge
compared to the Fast-RCNN pipeline. We also release the code for this paper.Comment: CVPR 2017 Camera Read
Spatial Memory for Context Reasoning in Object Detection
Modeling instance-level context and object-object relationships is extremely
challenging. It requires reasoning about bounding boxes of different classes,
locations \etc. Above all, instance-level spatial reasoning inherently requires
modeling conditional distributions on previous detections. Unfortunately, our
current object detection systems do not have any {\bf memory} to remember what
to condition on! The state-of-the-art object detectors still detect all object
in parallel followed by non-maximal suppression (NMS). While memory has been
used for tasks such as captioning, they mostly use image-level memory cells
without capturing the spatial layout. On the other hand, modeling object-object
relationships requires {\bf spatial} reasoning -- not only do we need a memory
to store the spatial layout, but also a effective reasoning module to extract
spatial patterns. This paper presents a conceptually simple yet powerful
solution -- Spatial Memory Network (SMN), to model the instance-level context
efficiently and effectively. Our spatial memory essentially assembles object
instances back into a pseudo "image" representation that is easy to be fed into
another ConvNet for object-object context reasoning. This leads to a new
sequential reasoning architecture where image and memory are processed in
parallel to obtain detections which update the memory again. We show our SMN
direction is promising as it provides 2.2\% improvement over baseline Faster
RCNN on the COCO dataset so far.Comment: Draft submitted to ICCV 201
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
The success of deep learning in vision can be attributed to: (a) models with
high capacity; (b) increased computational power; and (c) availability of
large-scale labeled data. Since 2012, there have been significant advances in
representation capabilities of the models and computational capabilities of
GPUs. But the size of the biggest dataset has surprisingly remained constant.
What will happen if we increase the dataset size by 10x or 100x? This paper
takes a step towards clearing the clouds of mystery surrounding the
relationship between `enormous data' and visual deep learning. By exploiting
the JFT-300M dataset which has more than 375M noisy labels for 300M images, we
investigate how the performance of current vision tasks would change if this
data was used for representation learning. Our paper delivers some surprising
(and some expected) findings. First, we find that the performance on vision
tasks increases logarithmically based on volume of training data size. Second,
we show that representation learning (or pre-training) still holds a lot of
promise. One can improve performance on many vision tasks by just training a
better base model. Finally, as expected, we present new state-of-the-art
results for different vision tasks including image classification, object
detection, semantic segmentation and human pose estimation. Our sincere hope is
that this inspires vision community to not undervalue the data and develop
collective efforts in building larger datasets.Comment: ICCV 2017 camera read
Unsupervised Learning of Visual Representations using Videos
Is strong supervision necessary for learning a good visual representation? Do
we really need millions of semantically-labeled images to train a Convolutional
Neural Network (CNN)? In this paper, we present a simple yet surprisingly
powerful approach for unsupervised learning of CNN. Specifically, we use
hundreds of thousands of unlabeled videos from the web to learn visual
representations. Our key idea is that visual tracking provides the supervision.
That is, two patches connected by a track should have similar visual
representation in deep feature space since they probably belong to the same
object or object part. We design a Siamese-triplet network with a ranking loss
function to train this CNN representation. Without using a single image from
ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train
an ensemble of unsupervised networks that achieves 52% mAP (no bounding box
regression). This performance comes tantalizingly close to its
ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We
also show that our unsupervised network can perform competitively in other
tasks such as surface-normal estimation
- …
