2,205 research outputs found
Hand Keypoint Detection in Single Images using Multiview Bootstrapping
We present an approach that uses a multi-camera system to train fine-grained
detectors for keypoints that are prone to occlusion, such as the joints of a
hand. We call this procedure multiview bootstrapping: first, an initial
keypoint detector is used to produce noisy labels in multiple views of the
hand. The noisy detections are then triangulated in 3D using multiview geometry
or marked as outliers. Finally, the reprojected triangulations are used as new
labeled training data to improve the detector. We repeat this process,
generating more labeled data in each iteration. We derive a result analytically
relating the minimum number of views to achieve target true and false positive
rates for a given detector. The method is used to train a hand keypoint
detector for single images. The resulting keypoint detector runs in realtime on
RGB images and has accuracy comparable to methods that use depth sensors. The
single view detector, triangulated over multiple views, enables 3D markerless
hand motion capture with complex object interactions.Comment: CVPR 201
V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map
Most of the existing deep learning-based methods for 3D hand and human pose
estimation from a single depth map are based on a common framework that takes a
2D depth map and directly regresses the 3D coordinates of keypoints, such as
hand or human body joints, via 2D convolutional neural networks (CNNs). The
first weakness of this approach is the presence of perspective distortion in
the 2D depth map. While the depth map is intrinsically 3D data, many previous
methods treat depth maps as 2D images that can distort the shape of the actual
object through projection from 3D to 2D space. This compels the network to
perform perspective distortion-invariant estimation. The second weakness of the
conventional approach is that directly regressing 3D coordinates from a 2D
image is a highly non-linear mapping, which causes difficulty in the learning
procedure. To overcome these weaknesses, we firstly cast the 3D hand and human
pose estimation problem from a single depth map into a voxel-to-voxel
prediction that uses a 3D voxelized grid and estimates the per-voxel likelihood
for each keypoint. We design our model as a 3D CNN that provides accurate
estimates while running in real-time. Our system outperforms previous methods
in almost all publicly available 3D hand and human pose estimation datasets and
placed first in the HANDS 2017 frame-based 3D hand pose estimation challenge.
The code is available in https://github.com/mks0601/V2V-PoseNet_RELEASE.Comment: HANDS 2017 Challenge Frame-based 3D Hand Pose Estimation Winner (ICCV
2017), Published at CVPR 201
RGB-D-based Action Recognition Datasets: A Survey
Human action recognition from RGB-D (Red, Green, Blue and Depth) data has
attracted increasing attention since the first work reported in 2010. Over this
period, many benchmark datasets have been created to facilitate the development
and evaluation of new algorithms. This raises the question of which dataset to
select and how to use it in providing a fair and objective comparative
evaluation against state-of-the-art methods. To address this issue, this paper
provides a comprehensive review of the most commonly used action recognition
related RGB-D video datasets, including 27 single-view datasets, 10 multi-view
datasets, and 7 multi-person datasets. The detailed information and analysis of
these datasets is a useful resource in guiding insightful selection of datasets
for future research. In addition, the issues with current algorithm evaluation
vis-\'{a}-vis limitations of the available datasets and evaluation protocols
are also highlighted; resulting in a number of recommendations for collection
of new datasets and use of evaluation protocols
Recommended from our members
Healthcare Event and Activity Logging.
The health of patients in the intensive care unit (ICU) can change frequently and inexplicably. Crucial events and activities responsible for these changes often go unnoticed. This paper introduces healthcare event and action logging (HEAL) which automatically and unobtrusively monitors and reports on events and activities that occur in a medical ICU room. HEAL uses a multimodal distributed camera network to monitor and identify ICU activities and estimate sanitation-event qualifiers. At the core is a novel approach to infer person roles based on semantic interactions, a critical requirement in many healthcare settings where individuals' identities must not be identified. The proposed approach for activity representation identifies contextual aspects basis and estimates aspect weights for proper action representation and reconstruction. The flexibility of the proposed algorithms enables the identification of people roles by associating them with inferred interactions and detected activities. A fully working prototype system is developed, tested in a mock ICU room and then deployed in two ICU rooms at a community hospital, thus offering unique capabilities for data gathering and analytics. The proposed method achieves a role identification accuracy of 84% and a backtracking role identification of 79% for obscured roles using interaction and appearance features on real ICU data. Detailed experimental results are provided in the context of four event-sanitation qualifiers: clean, transmission, contamination, and unclean
- …