11,704 research outputs found
Accurate Single Stage Detector Using Recurrent Rolling Convolution
Most of the recent successful methods in accurate object detection and
localization used some variants of R-CNN style two stage Convolutional Neural
Networks (CNN) where plausible regions were proposed in the first stage then
followed by a second stage for decision refinement. Despite the simplicity of
training and the efficiency in deployment, the single stage detection methods
have not been as competitive when evaluated in benchmarks consider mAP for high
IoU thresholds. In this paper, we proposed a novel single stage end-to-end
trainable object detection network to overcome this limitation. We achieved
this by introducing Recurrent Rolling Convolution (RRC) architecture over
multi-scale feature maps to construct object classifiers and bounding box
regressors which are "deep in context". We evaluated our method in the
challenging KITTI dataset which measures methods under IoU threshold of 0.7. We
showed that with RRC, a single reduced VGG-16 based model already significantly
outperformed all the previously published results. At the time this paper was
written our models ranked the first in KITTI car detection (the hard level),
the first in cyclist detection and the second in pedestrian detection. These
results were not reached by the previous single stage methods. The code is
publicly available.Comment: CVPR 201
Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification
Recent work on scene classification still makes use of generic CNN features
in a rudimentary manner. In this ICCV 2015 paper, we present a novel pipeline
built upon deep CNN features to harvest discriminative visual objects and parts
for scene classification. We first use a region proposal technique to generate
a set of high-quality patches potentially containing objects, and apply a
pre-trained CNN to extract generic deep features from these patches. Then we
perform both unsupervised and weakly supervised learning to screen these
patches and discover discriminative ones representing category-specific objects
and parts. We further apply discriminative clustering enhanced with local CNN
fine-tuning to aggregate similar objects and parts into groups, called meta
objects. A scene image representation is constructed by pooling the feature
response maps of all the learned meta objects at multiple spatial scales. We
have confirmed that the scene image representation obtained using this new
pipeline is capable of delivering state-of-the-art performance on two popular
scene benchmark datasets, MIT Indoor 67~\cite{MITIndoor67} and
Sun397~\cite{Sun397}Comment: To Appear in ICCV 201
Deep Descriptor Transforming for Image Co-Localization
Reusable model design becomes desirable with the rapid expansion of machine
learning applications. In this paper, we focus on the reusability of
pre-trained deep convolutional models. Specifically, different from treating
pre-trained models as feature extractors, we reveal more treasures beneath
convolutional layers, i.e., the convolutional activations could act as a
detector for the common object in the image co-localization problem. We propose
a simple but effective method, named Deep Descriptor Transforming (DDT), for
evaluating the correlations of descriptors and then obtaining the
category-consistent regions, which can accurately locate the common object in a
set of images. Empirical studies validate the effectiveness of the proposed DDT
method. On benchmark image co-localization datasets, DDT consistently
outperforms existing state-of-the-art methods by a large margin. Moreover, DDT
also demonstrates good generalization ability for unseen categories and
robustness for dealing with noisy data.Comment: Accepted by IJCAI 201
Frustum PointNets for 3D Object Detection from RGB-D Data
In this work, we study 3D object detection from RGB-D data in both indoor and
outdoor scenes. While previous methods focus on images or 3D voxels, often
obscuring natural 3D patterns and invariances of 3D data, we directly operate
on raw point clouds by popping up RGB-D scans. However, a key challenge of this
approach is how to efficiently localize objects in point clouds of large-scale
scenes (region proposal). Instead of solely relying on 3D proposals, our method
leverages both mature 2D object detectors and advanced 3D deep learning for
object localization, achieving efficiency as well as high recall for even small
objects. Benefited from learning directly in raw point clouds, our method is
also able to precisely estimate 3D bounding boxes even under strong occlusion
or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection
benchmarks, our method outperforms the state of the art by remarkable margins
while having real-time capability.Comment: 15 pages, 12 figures, 14 table
- …