2,509 research outputs found
Object localization in ImageNet by looking out of the window
We propose a method for annotating the location of objects in ImageNet.
Traditionally, this is cast as an image window classification problem, where
each window is considered independently and scored based on its appearance
alone. Instead, we propose a method which scores each candidate window in the
context of all other windows in the image, taking into account their similarity
in appearance space as well as their spatial relations in the image plane. We
devise a fast and exact procedure to optimize our scoring function over all
candidate windows in an image, and we learn its parameters using structured
output regression. We demonstrate on 92000 images from ImageNet that this
significantly improves localization over recent techniques that score windows
in isolation.Comment: in BMVC 201
ImageNet Large Scale Visual Recognition Challenge
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in
object category classification and detection on hundreds of object categories
and millions of images. The challenge has been run annually from 2010 to
present, attracting participation from more than fifty institutions.
This paper describes the creation of this benchmark dataset and the advances
in object recognition that have been possible as a result. We discuss the
challenges of collecting large-scale ground truth annotation, highlight key
breakthroughs in categorical object recognition, provide a detailed analysis of
the current state of the field of large-scale image classification and object
detection, and compare the state-of-the-art computer vision accuracy with human
accuracy. We conclude with lessons learned in the five years of the challenge,
and propose future directions and improvements.Comment: 43 pages, 16 figures. v3 includes additional comparisons with PASCAL
VOC (per-category comparisons in Table 3, distribution of localization
difficulty in Fig 16), a list of queries used for obtaining object detection
images (Appendix C), and some additional reference
Taking a Deeper Look at Pedestrians
In this paper we study the use of convolutional neural networks (convnets)
for the task of pedestrian detection. Despite their recent diverse successes,
convnets historically underperform compared to other pedestrian detectors. We
deliberately omit explicitly modelling the problem into the network (e.g. parts
or occlusion modelling) and show that we can reach competitive performance
without bells and whistles. In a wide range of experiments we analyse small and
big convnets, their architectural choices, parameters, and the influence of
different training data, including pre-training on surrogate tasks.
We present the best convnet detectors on the Caltech and KITTI dataset. On
Caltech our convnets reach top performance both for the Caltech1x and
Caltech10x training setup. Using additional data at training time our strongest
convnet model is competitive even to detectors that use additional data
(optical flow) at test time
How good are detection proposals, really?
Current top performing Pascal VOC object detectors employ detection proposals
to guide the search for objects thereby avoiding exhaustive sliding window
search across images. Despite the popularity of detection proposals, it is
unclear which trade-offs are made when using them during object detection. We
provide an in depth analysis of ten object proposal methods along with four
baselines regarding ground truth annotation recall (on Pascal VOC 2007 and
ImageNet 2013), repeatability, and impact on DPM detector performance. Our
findings show common weaknesses of existing methods, and provide insights to
choose the most adequate method for different settings
Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning
Object category localization is a challenging problem in computer vision.
Standard supervised training requires bounding box annotations of object
instances. This time-consuming annotation process is sidestepped in weakly
supervised learning. In this case, the supervised information is restricted to
binary labels that indicate the absence/presence of object instances in the
image, without their locations. We follow a multiple-instance learning approach
that iteratively trains the detector and infers the object locations in the
positive training images. Our main contribution is a multi-fold multiple
instance learning procedure, which prevents training from prematurely locking
onto erroneous object locations. This procedure is particularly important when
using high-dimensional representations, such as Fisher vectors and
convolutional neural network features. We also propose a window refinement
method, which improves the localization accuracy by incorporating an objectness
prior. We present a detailed experimental evaluation using the PASCAL VOC 2007
dataset, which verifies the effectiveness of our approach.Comment: To appear in IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI
Scalable Object Detection using Deep Neural Networks
Deep convolutional neural networks have recently achieved state-of-the-art
performance on a number of image recognition benchmarks, including the ImageNet
Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on
the localization sub-task was a network that predicts a single bounding box and
a confidence score for each object category in the image. Such a model captures
the whole-image context around the objects but cannot handle multiple instances
of the same object in the image without naively replicating the number of
outputs for each instance. In this work, we propose a saliency-inspired neural
network model for detection, which predicts a set of class-agnostic bounding
boxes along with a single score for each box, corresponding to its likelihood
of containing any object of interest. The model naturally handles a variable
number of instances for each class and allows for cross-class generalization at
the highest levels of the network. We are able to obtain competitive
recognition performance on VOC2007 and ILSVRC2012, while using only the top few
predicted locations in each image and a small number of neural network
evaluations
- …