775 research outputs found
Parsing Occluded People by Flexible Compositions
This paper presents an approach to parsing humans when there is significant
occlusion. We model humans using a graphical model which has a tree structure
building on recent work [32, 6] and exploit the connectivity prior that, even
in presence of occlusion, the visible nodes form a connected subtree of the
graphical model. We call each connected subtree a flexible composition of
object parts. This involves a novel method for learning occlusion cues. During
inference we need to search over a mixture of different flexible models. By
exploiting part sharing, we show that this inference can be done extremely
efficiently requiring only twice as many computations as searching for the
entire object (i.e., not modeling occlusion). We evaluate our model on the
standard benchmarked "We Are Family" Stickmen dataset and obtain significant
performance improvements over the best alternative algorithms.Comment: CVPR 15 Camera Read
3D hand pose estimation using convolutional neural networks
3D hand pose estimation plays a fundamental role in natural human computer interactions. The problem is challenging due to complicated variations caused by complex articulations, multiple viewpoints, self-similar parts, severe self-occlusions, different shapes and sizes.
To handle these challenges, the thesis makes the following contributions. First, the problem of the multiple viewpoints and complex articulations of hand pose estimation is tackled by decomposing and transforming the input and output space by spatial transformations following the hand structure. By the transformation, both the variation of the input space and output is reduced, which makes the learning easier.
The second contribution is a probabilistic framework integrating all the hierarchical regressions. Variants with/without sampling, using different regressors and optimization methods are constructed and compared to provide an insight of the components under this framework.
The third contribution is based on the observation that for images with occlusions, there exist multiple plausible configurations for the occluded parts.
A hierarchical mixture density network is proposed to handle the multi-modality of the locations for occluded hand joints. It leverages the state-of-the-art hand pose estimators based on Convolutional Neural Networks to facilitate feature learning while models the multiple modes in a two-level hierarchy to reconcile single-valued (for visible joints) and multi-valued (for occluded joints) mapping in its output.
In addition, a complete labeled real hand datasets is collected by a tracking system with six 6D magnetic sensors and inverse kinematics to automatically obtain 21-joints hand pose annotations of depth maps.Open Acces
Fine-Grained Classification of Pedestrians in Video: Benchmark and State of the Art
A video dataset that is designed to study fine-grained categorisation of pedestrians is introduced. Pedestrians were recorded “in-the-wild” from a moving vehicle. Annotations include bounding boxes, tracks, 14 keypoints with occlusion information and the fine-grained categories of age (5 classes), sex (2 classes), weight (3 classes) and clothing style (4 classes). There are a total of 27,454 bounding box and pose labels across 4222 tracks. This dataset is designed to train and test algorithms for fine-grained categorisation of people; it is also useful for benchmarking tracking, detection and pose estimation of pedestrians. State-of-the-art algorithms for fine-grained classification and pose estimation were tested using the dataset and the results are reported as a useful performance baseline
Towards Scene Understanding with Detailed 3D Object Representations
Current approaches to semantic image and scene understanding typically employ
rather simple object representations such as 2D or 3D bounding boxes. While
such coarse models are robust and allow for reliable object detection, they
discard much of the information about objects' 3D shape and pose, and thus do
not lend themselves well to higher-level reasoning. Here, we propose to base
scene understanding on a high-resolution object representation. An object class
- in our case cars - is modeled as a deformable 3D wireframe, which enables
fine-grained modeling at the level of individual vertices and faces. We augment
that model to explicitly include vertex-level occlusion, and embed all
instances in a common coordinate frame, in order to infer and exploit
object-object interactions. Specifically, from a single view we jointly
estimate the shapes and poses of multiple objects in a common 3D frame. A
ground plane in that frame is estimated by consensus among different objects,
which significantly stabilizes monocular 3D pose estimation. The fine-grained
model, in conjunction with the explicit 3D scene model, further allows one to
infer part-level occlusions between the modeled objects, as well as occlusions
by other, unmodeled scene elements. To demonstrate the benefits of such
detailed object class models in the context of scene understanding we
systematically evaluate our approach on the challenging KITTI street scene
dataset. The experiments show that the model's ability to utilize image
evidence at the level of individual parts improves monocular 3D pose estimation
w.r.t. both location and (continuous) viewpoint.Comment: International Journal of Computer Vision (appeared online on 4
November 2014). Online version:
http://link.springer.com/article/10.1007/s11263-014-0780-
Facial Landmark Detection Evaluation on MOBIO Database
MOBIO is a bi-modal database that was captured almost exclusively on mobile
phones. It aims to improve research into deploying biometric techniques to
mobile devices. Research has been shown that face and speaker recognition can
be performed in a mobile environment. Facial landmark localization aims at
finding the coordinates of a set of pre-defined key points for 2D face images.
A facial landmark usually has specific semantic meaning, e.g. nose tip or eye
centre, which provides rich geometric information for other face analysis tasks
such as face recognition, emotion estimation and 3D face reconstruction. Pretty
much facial landmark detection methods adopt still face databases, such as
300W, AFW, AFLW, or COFW, for evaluation, but seldomly use mobile data. Our
work is first to perform facial landmark detection evaluation on the mobile
still data, i.e., face images from MOBIO database. About 20,600 face images
have been extracted from this audio-visual database and manually labeled with
22 landmarks as the groundtruth. Several state-of-the-art facial landmark
detection methods are adopted to evaluate their performance on these data. The
result shows that the data from MOBIO database is pretty challenging. This
database can be a new challenging one for facial landmark detection evaluation.Comment: 13 pages, 10 figure
- …