93,178 research outputs found
Attend Refine Repeat: Active Box Proposal Generation via In-Out Localization
The problem of computing category agnostic bounding box proposals is utilized
as a core component in many computer vision tasks and thus has lately attracted
a lot of attention. In this work we propose a new approach to tackle this
problem that is based on an active strategy for generating box proposals that
starts from a set of seed boxes, which are uniformly distributed on the image,
and then progressively moves its attention on the promising image areas where
it is more likely to discover well localized bounding box proposals. We call
our approach AttractioNet and a core component of it is a CNN-based category
agnostic object location refinement module that is capable of yielding accurate
and robust bounding box predictions regardless of the object category.
We extensively evaluate our AttractioNet approach on several image datasets
(i.e. COCO, PASCAL, ImageNet detection and NYU-Depth V2 datasets) reporting on
all of them state-of-the-art results that surpass the previous work in the
field by a significant margin and also providing strong empirical evidence that
our approach is capable to generalize to unseen categories. Furthermore, we
evaluate our AttractioNet proposals in the context of the object detection task
using a VGG16-Net based detector and the achieved detection performance on COCO
manages to significantly surpass all other VGG16-Net based detectors while even
being competitive with a heavily tuned ResNet-101 based detector. Code as well
as box proposals computed for several datasets are available at::
https://github.com/gidariss/AttractioNet.Comment: Technical report. Code as well as box proposals computed for several
datasets are available at:: https://github.com/gidariss/AttractioNe
A robust and efficient video representation for action recognition
This paper introduces a state-of-the-art video representation and applies it
to efficient action recognition and detection. We first propose to improve the
popular dense trajectory features by explicit camera motion estimation. More
specifically, we extract feature point matches between frames using SURF
descriptors and dense optical flow. The matches are used to estimate a
homography with RANSAC. To improve the robustness of homography estimation, a
human detector is employed to remove outlier matches from the human body as
human motion is not constrained by the camera. Trajectories consistent with the
homography are considered as due to camera motion, and thus removed. We also
use the homography to cancel out camera motion from the optical flow. This
results in significant improvement on motion-based HOF and MBH descriptors. We
further explore the recent Fisher vector as an alternative feature encoding
approach to the standard bag-of-words histogram, and consider different ways to
include spatial layout information in these encodings. We present a large and
varied set of evaluations, considering (i) classification of short basic
actions on six datasets, (ii) localization of such actions in feature-length
movies, and (iii) large-scale recognition of complex events. We find that our
improved trajectory features significantly outperform previous dense
trajectories, and that Fisher vectors are superior to bag-of-words encodings
for video recognition tasks. In all three tasks, we show substantial
improvements over the state-of-the-art results
- …