950 research outputs found
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
MonoPerfCap: Human Performance Capture from Monocular Video
We present the first marker-less approach for temporally coherent 3D
performance capture of a human with general clothing from monocular video. Our
approach reconstructs articulated human skeleton motion as well as medium-scale
non-rigid surface deformations in general scenes. Human performance capture is
a challenging problem due to the large range of articulation, potentially fast
motion, and considerable non-rigid deformations, even from multi-view data.
Reconstruction from monocular video alone is drastically more challenging,
since strong occlusions and the inherent depth ambiguity lead to a highly
ill-posed reconstruction problem. We tackle these challenges by a novel
approach that employs sparse 2D and 3D human pose detections from a
convolutional neural network using a batch-based pose estimation strategy.
Joint recovery of per-batch motion allows to resolve the ambiguities of the
monocular reconstruction problem based on a low dimensional trajectory
subspace. In addition, we propose refinement of the surface geometry based on
fully automatically extracted silhouettes to enable medium-scale non-rigid
alignment. We demonstrate state-of-the-art performance capture results that
enable exciting applications such as video editing and free viewpoint video,
previously infeasible from monocular video. Our qualitative and quantitative
evaluation demonstrates that our approach significantly outperforms previous
monocular methods in terms of accuracy, robustness and scene complexity that
can be handled.Comment: Accepted to ACM TOG 2018, to be presented on SIGGRAPH 201
Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd
Object detection and 6D pose estimation in the crowd (scenes with multiple
object instances, severe foreground occlusions and background distractors), has
become an important problem in many rapidly evolving technological areas such
as robotics and augmented reality. Single shot-based 6D pose estimators with
manually designed features are still unable to tackle the above challenges,
motivating the research towards unsupervised feature learning and
next-best-view estimation. In this work, we present a complete framework for
both single shot-based 6D object pose estimation and next-best-view prediction
based on Hough Forests, the state of the art object pose estimator that
performs classification and regression jointly. Rather than using manually
designed features we a) propose an unsupervised feature learnt from
depth-invariant patches using a Sparse Autoencoder and b) offer an extensive
evaluation of various state of the art features. Furthermore, taking advantage
of the clustering performed in the leaf nodes of Hough Forests, we learn to
estimate the reduction of uncertainty in other views, formulating the problem
of selecting the next-best-view. To further improve pose estimation, we propose
an improved joint registration and hypotheses verification module as a final
refinement step to reject false detections. We provide two additional
challenging datasets inspired from realistic scenarios to extensively evaluate
the state of the art and our framework. One is related to domestic environments
and the other depicts a bin-picking scenario mostly found in industrial
settings. We show that our framework significantly outperforms state of the art
both on public and on our datasets.Comment: CVPR 2016 accepted paper, project page:
http://www.iis.ee.ic.ac.uk/rkouskou/6D_NBV.htm
Maximized Posteriori Attributes Selection from Facial Salient Landmarks for Face Recognition
This paper presents a robust and dynamic face recognition technique based on
the extraction and matching of devised probabilistic graphs drawn on SIFT
features related to independent face areas. The face matching strategy is based
on matching individual salient facial graph characterized by SIFT features as
connected to facial landmarks such as the eyes and the mouth. In order to
reduce the face matching errors, the Dempster-Shafer decision theory is applied
to fuse the individual matching scores obtained from each pair of salient
facial features. The proposed algorithm is evaluated with the ORL and the IITK
face databases. The experimental results demonstrate the effectiveness and
potential of the proposed face recognition technique also in case of partially
occluded faces.Comment: 8 pages, 2 figure
OccupancyDETR: Making Semantic Scene Completion as Straightforward as Object Detection
Visual-based 3D semantic occupancy perception (also known as 3D semantic
scene completion) is a new perception paradigm for robotic applications like
autonomous driving. Compared with Bird's Eye View (BEV) perception, it extends
the vertical dimension, significantly enhancing the ability of robots to
understand their surroundings. However, due to this very reason, the
computational demand for current 3D semantic occupancy perception methods
generally surpasses that of BEV perception methods and 2D perception methods.
We propose a novel 3D semantic occupancy perception method, OccupancyDETR,
which consists of a DETR-like object detection module and a 3D occupancy
decoder module. The integration of object detection simplifies our method
structurally - instead of predicting the semantics of each voxels, it
identifies objects in the scene and their respective 3D occupancy grids. This
speeds up our method, reduces required resources, and leverages object
detection algorithm, giving our approach notable performance on small objects.
We demonstrate the effectiveness of our proposed method on the SemanticKITTI
dataset, showcasing an mIoU of 23 and a processing speed of 6 frames per
second, thereby presenting a promising solution for real-time 3D semantic scene
completion
- …