291 research outputs found
Convolutional Networks for Object Category and 3D Pose Estimation from 2D Images
Current CNN-based algorithms for recovering the 3D pose of an object in an
image assume knowledge about both the object category and its 2D localization
in the image. In this paper, we relax one of these constraints and propose to
solve the task of joint object category and 3D pose estimation from an image
assuming known 2D localization. We design a new architecture for this task
composed of a feature network that is shared between subtasks, an object
categorization network built on top of the feature network, and a collection of
category dependent pose regression networks. We also introduce suitable loss
functions and a training method for the new architecture. Experiments on the
challenging PASCAL3D+ dataset show state-of-the-art performance in the joint
categorization and pose estimation task. Moreover, our performance on the joint
task is comparable to the performance of state-of-the-art methods on the
simpler 3D pose estimation with known object category task
S-OHEM: Stratified Online Hard Example Mining for Object Detection
One of the major challenges in object detection is to propose detectors with
highly accurate localization of objects. The online sampling of high-loss
region proposals (hard examples) uses the multitask loss with equal weight
settings across all loss types (e.g, classification and localization, rigid and
non-rigid categories) and ignores the influence of different loss distributions
throughout the training process, which we find essential to the training
efficacy. In this paper, we present the Stratified Online Hard Example Mining
(S-OHEM) algorithm for training higher efficiency and accuracy detectors.
S-OHEM exploits OHEM with stratified sampling, a widely-adopted sampling
technique, to choose the training examples according to this influence during
hard example mining, and thus enhance the performance of object detectors. We
show through systematic experiments that S-OHEM yields an average precision
(AP) improvement of 0.5% on rigid categories of PASCAL VOC 2007 for both the
IoU threshold of 0.6 and 0.7. For KITTI 2012, both results of the same metric
are 1.6%. Regarding the mean average precision (mAP), a relative increase of
0.3% and 0.5% (1% and 0.5%) is observed for VOC07 (KITTI12) using the same set
of IoU threshold. Also, S-OHEM is easy to integrate with existing region-based
detectors and is capable of acting with post-recognition level regressors.Comment: 9 pages, 3 figures, accepted by CCCV 201
Efficient On-the-fly Category Retrieval using ConvNets and GPUs
We investigate the gains in precision and speed, that can be obtained by
using Convolutional Networks (ConvNets) for on-the-fly retrieval - where
classifiers are learnt at run time for a textual query from downloaded images,
and used to rank large image or video datasets.
We make three contributions: (i) we present an evaluation of state-of-the-art
image representations for object category retrieval over standard benchmark
datasets containing 1M+ images; (ii) we show that ConvNets can be used to
obtain features which are incredibly performant, and yet much lower dimensional
than previous state-of-the-art image representations, and that their
dimensionality can be reduced further without loss in performance by
compression using product quantization or binarization. Consequently, features
with the state-of-the-art performance on large-scale datasets of millions of
images can fit in the memory of even a commodity GPU card; (iii) we show that
an SVM classifier can be learnt within a ConvNet framework on a GPU in parallel
with downloading the new training images, allowing for a continuous refinement
of the model as more images become available, and simultaneous training and
ranking. The outcome is an on-the-fly system that significantly outperforms its
predecessors in terms of: precision of retrieval, memory requirements, and
speed, facilitating accurate on-the-fly learning and ranking in under a second
on a single GPU.Comment: Published in proceedings of ACCV 201
Semantically Guided Depth Upsampling
We present a novel method for accurate and efficient up- sampling of sparse
depth data, guided by high-resolution imagery. Our approach goes beyond the use
of intensity cues only and additionally exploits object boundary cues through
structured edge detection and semantic scene labeling for guidance. Both cues
are combined within a geodesic distance measure that allows for
boundary-preserving depth in- terpolation while utilizing local context. We
model the observed scene structure by locally planar elements and formulate the
upsampling task as a global energy minimization problem. Our method determines
glob- ally consistent solutions and preserves fine details and sharp depth
bound- aries. In our experiments on several public datasets at different levels
of application, we demonstrate superior performance of our approach over the
state-of-the-art, even for very sparse measurements.Comment: German Conference on Pattern Recognition 2016 (Oral
Deep Bilevel Learning
We present a novel regularization approach to train neural networks that
enjoys better generalization and test error than standard stochastic gradient
descent. Our approach is based on the principles of cross-validation, where a
validation set is used to limit the model overfitting. We formulate such
principles as a bilevel optimization problem. This formulation allows us to
define the optimization of a cost on the validation set subject to another
optimization on the training set. The overfitting is controlled by introducing
weights on each mini-batch in the training set and by choosing their values so
that they minimize the error on the validation set. In practice, these weights
define mini-batch learning rates in a gradient descent update equation that
favor gradients with better generalization capabilities. Because of its
simplicity, this approach can be integrated with other regularization methods
and training schemes. We evaluate extensively our proposed algorithm on several
neural network architectures and datasets, and find that it consistently
improves the generalization of the model, especially when labels are noisy.Comment: ECCV 201
Localization Recall Precision (LRP): A New Performance Metric for Object Detection
Average precision (AP), the area under the recall-precision (RP) curve, is
the standard performance measure for object detection. Despite its wide
acceptance, it has a number of shortcomings, the most important of which are
(i) the inability to distinguish very different RP curves, and (ii) the lack of
directly measuring bounding box localization accuracy. In this paper, we
propose 'Localization Recall Precision (LRP) Error', a new metric which we
specifically designed for object detection. LRP Error is composed of three
components related to localization, false negative (FN) rate and false positive
(FP) rate. Based on LRP, we introduce the 'Optimal LRP', the minimum achievable
LRP error representing the best achievable configuration of the detector in
terms of recall-precision and the tightness of the boxes. In contrast to AP,
which considers precisions over the entire recall domain, Optimal LRP
determines the 'best' confidence score threshold for a class, which balances
the trade-off between localization and recall-precision. In our experiments, we
show that, for state-of-the-art object (SOTA) detectors, Optimal LRP provides
richer and more discriminative information than AP. We also demonstrate that
the best confidence score thresholds vary significantly among classes and
detectors. Moreover, we present LRP results of a simple online video object
detector which uses a SOTA still image object detector and show that the
class-specific optimized thresholds increase the accuracy against the common
approach of using a general threshold for all classes. At
https://github.com/cancam/LRP we provide the source code that can compute LRP
for the PASCAL VOC and MSCOCO datasets. Our source code can easily be adapted
to other datasets as well.Comment: to appear in ECCV 201
A Review of Object Detection Models based on Convolutional Neural Network
Convolutional Neural Network (CNN) has become the state-of-the-art for object
detection in image task. In this chapter, we have explained different
state-of-the-art CNN based object detection models. We have made this review
with categorization those detection models according to two different
approaches: two-stage approach and one-stage approach. Through this chapter, it
has shown advancements in object detection models from R-CNN to latest
RefineDet. It has also discussed the model description and training details of
each model. Here, we have also drawn a comparison among those models.Comment: 17 pages, 11 figures, 1 tabl
Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery
Automatic multi-class object detection in remote sensing images in
unconstrained scenarios is of high interest for several applications including
traffic monitoring and disaster management. The huge variation in object scale,
orientation, category, and complex backgrounds, as well as the different camera
sensors pose great challenges for current algorithms. In this work, we propose
a new method consisting of a novel joint image cascade and feature pyramid
network with multi-size convolution kernels to extract multi-scale strong and
weak semantic features. These features are fed into rotation-based region
proposal and region of interest networks to produce object detections. Finally,
rotational non-maximum suppression is applied to remove redundant detections.
During training, we minimize joint horizontal and oriented bounding box loss
functions, as well as a novel loss that enforces oriented boxes to be
rectangular. Our method achieves 68.16% mAP on horizontal and 72.45% mAP on
oriented bounding box detection tasks on the challenging DOTA dataset,
outperforming all published methods by a large margin (+6% and +12% absolute
improvement, respectively). Furthermore, it generalizes to two other datasets,
NWPU VHR-10 and UCAS-AOD, and achieves competitive results with the baselines
even when trained on DOTA. Our method can be deployed in multi-class object
detection applications, regardless of the image and object scales and
orientations, making it a great choice for unconstrained aerial and satellite
imagery.Comment: ACCV 201
Image Co-localization by Mimicking a Good Detector's Confidence Score Distribution
Given a set of images containing objects from the same category, the task of
image co-localization is to identify and localize each instance. This paper
shows that this problem can be solved by a simple but intriguing idea, that is,
a common object detector can be learnt by making its detection confidence
scores distributed like those of a strongly supervised detector. More
specifically, we observe that given a set of object proposals extracted from an
image that contains the object of interest, an accurate strongly supervised
object detector should give high scores to only a small minority of proposals,
and low scores to most of them. Thus, we devise an entropy-based objective
function to enforce the above property when learning the common object
detector. Once the detector is learnt, we resort to a segmentation approach to
refine the localization. We show that despite its simplicity, our approach
outperforms state-of-the-art methods.Comment: Accepted to Proc. European Conf. Computer Vision 201
- …