127,292 research outputs found

    V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map

    Full text link
    Most of the existing deep learning-based methods for 3D hand and human pose estimation from a single depth map are based on a common framework that takes a 2D depth map and directly regresses the 3D coordinates of keypoints, such as hand or human body joints, via 2D convolutional neural networks (CNNs). The first weakness of this approach is the presence of perspective distortion in the 2D depth map. While the depth map is intrinsically 3D data, many previous methods treat depth maps as 2D images that can distort the shape of the actual object through projection from 3D to 2D space. This compels the network to perform perspective distortion-invariant estimation. The second weakness of the conventional approach is that directly regressing 3D coordinates from a 2D image is a highly non-linear mapping, which causes difficulty in the learning procedure. To overcome these weaknesses, we firstly cast the 3D hand and human pose estimation problem from a single depth map into a voxel-to-voxel prediction that uses a 3D voxelized grid and estimates the per-voxel likelihood for each keypoint. We design our model as a 3D CNN that provides accurate estimates while running in real-time. Our system outperforms previous methods in almost all publicly available 3D hand and human pose estimation datasets and placed first in the HANDS 2017 frame-based 3D hand pose estimation challenge. The code is available in https://github.com/mks0601/V2V-PoseNet_RELEASE.Comment: HANDS 2017 Challenge Frame-based 3D Hand Pose Estimation Winner (ICCV 2017), Published at CVPR 201

    Simultaneous Hand Pose and Skeleton Bone-Lengths Estimation from a Single Depth Image

    Full text link
    Articulated hand pose estimation is a challenging task for human-computer interaction. The state-of-the-art hand pose estimation algorithms work only with one or a few subjects for which they have been calibrated or trained. Particularly, the hybrid methods based on learning followed by model fitting or model based deep learning do not explicitly consider varying hand shapes and sizes. In this work, we introduce a novel hybrid algorithm for estimating the 3D hand pose as well as bone-lengths of the hand skeleton at the same time, from a single depth image. The proposed CNN architecture learns hand pose parameters and scale parameters associated with the bone-lengths simultaneously. Subsequently, a new hybrid forward kinematics layer employs both parameters to estimate 3D joint positions of the hand. For end-to-end training, we combine three public datasets NYU, ICVL and MSRA-2015 in one unified format to achieve large variation in hand shapes and sizes. Among hybrid methods, our method shows improved accuracy over the state-of-the-art on the combined dataset and the ICVL dataset that contain multiple subjects. Also, our algorithm is demonstrated to work well with unseen images.Comment: This paper has been accepted and presented in 3DV-2017 conference held at Qingdao, China. http://irc.cs.sdu.edu.cn/3dv

    Computational Learning for Hand Pose Estimation

    Get PDF
    Rapid advances in human–computer interaction interfaces have been promising a realistic environment for gaming and entertainment in the last few years. However, the use of traditional input devices such as trackballs, keyboards, or joysticks has been a bottleneck for natural interactions between a human and computer as two points of freedom of these devices cannot suitably emulate the interactions in a three-dimensional space. Consequently, a comprehensive hand tracking technology is expected as a smart and intuitive option to these input tools to enhance virtual and augmented reality experiences. In addition, the recent emergence of low-cost depth sensing cameras has led to their broad use of RGB-D data in computer vision, raising expectations of a full 3D interpretation of hand movements for human–computer interaction interfaces. Although the use of hand gestures or hand postures has become essential for a wide range of applications in computer games and augmented/virtual reality, 3D hand pose estimation is still an open and challenging problem because of the following reasons: (i) the hand pose exists in a high-dimensional space because each finger and the palm is associated with several degrees of freedom, (ii) the fingers exhibit self-similarity and often occlude to each other, (iii) global 3D rotations make pose estimation more difficult, and (iv) hands only exist in few pixels in images and the noise in acquired data coupled with fast finger movement confounds continuous hand tracking. The success of hand tracking would naturally depend on synthesizing our knowledge of the hand (i.e., geometric shape, constraints on pose configurations) and latent features about hand poses from the RGB-D data stream (i.e., region of interest, key feature points like finger tips and joints, and temporal continuity). In this thesis, we propose novel methods to leverage the paradigm of analysis by synthesis and create a prediction model using a population of realistic 3D hand poses. The overall goal of this work is to design a concrete framework so the computers can learn and understand about perceptual attributes of human hands (i.e., self-occlusions or self-similarities of the fingers) and to develop a pragmatic solution to the real-time hand pose estimation problem implementable on a standard computer. This thesis can be broadly divided into four parts: learning hand (i) from recommendiations of similar hand poses, (ii) from low-dimensional visual representations, (iii) by hallucinating geometric representations, and (iv) from a manipulating object. Each research work covers our algorithmic contributions to solve the 3D hand pose estimation problem. Additionally, the research work in the appendix proposes a pragmatic technique for applying our ideas to mobile devices with low computational power. Following a given structure, we first overview the most relevant works on depth sensor-based 3D hand pose estimation in the literature both with and without manipulating an object. Two different approaches prevalent for categorizing hand pose estimation, model-based methods and appearance-based methods, are discussed in detail. In this chapter, we also introduce some works relevant to deep learning and trials to achieve efficient compression of the network structure. Next, we describe a synthetic 3D hand model and its motion constraints for simulating realistic human hand movements. The section for the primary research work starts in the following chapter. We discuss our attempts to produce a better estimation model for 3D hand pose estimation by learning hand articulations from recommendations of similar poses. Specifically, the unknown pose parameters for input depth data are estimated by collaboratively learning the known parameters of all neighborhood poses. Subsequently, we discuss deep-learned, discriminative, and low-dimensional features and a hierarchical solution of the stated problem based on the matrix completion framework. This work is further extended by incorporating a function of geometric properties on the surface of the hand described by heat diffusion, which is robust to capture both the local geometry of the hand and global structural representations. The problem of the hands interactions with a physical object is also considered in the following chapter. The main insight is that the interacting object can be a source of constraint on hand poses. In this view, we employ pose dependency on the shape of the object to learn the discriminative features of the hand–object interaction, rather than losing hand information caused by partial or full object occlusions. Subsequently, we present a compressive learning technique in the appendix. Our approach is flexible, enabling us to add more layers and go deeper in the deep learning architecture while keeping the number of parameters the same. Finally, we conclude this thesis work by summarizing the presented approaches for hand pose estimation and then propose future directions to further achieve performance improvements through (i) realistically rendered synthetic hand images, (ii) incorporating RGB images as an input, (iii) hand perseonalization, (iv) use of unstructured point cloud, and (v) embedding sensing techniques

    Rule Of Thumb: Deep derotation for improved fingertip detection

    Full text link
    We investigate a novel global orientation regression approach for articulated objects using a deep convolutional neural network. This is integrated with an in-plane image derotation scheme, DeROT, to tackle the problem of per-frame fingertip detection in depth images. The method reduces the complexity of learning in the space of articulated poses which is demonstrated by using two distinct state-of-the-art learning based hand pose estimation methods applied to fingertip detection. Significant classification improvements are shown over the baseline implementation. Our framework involves no tracking, kinematic constraints or explicit prior model of the articulated object in hand. To support our approach we also describe a new pipeline for high accuracy magnetic annotation and labeling of objects imaged by a depth camera.Comment: To be published in proceedings of BMVC 201

    Hand Pose Estimation with Mems-Ultrasonic Sensors

    Full text link
    Hand tracking is an important aspect of human-computer interaction and has a wide range of applications in extended reality devices. However, current hand motion capture methods suffer from various limitations. For instance, visual-based hand pose estimation is susceptible to self-occlusion and changes in lighting conditions, while IMU-based tracking gloves experience significant drift and are not resistant to external magnetic field interference. To address these issues, we propose a novel and low-cost hand-tracking glove that utilizes several MEMS-ultrasonic sensors attached to the fingers, to measure the distance matrix among the sensors. Our lightweight deep network then reconstructs the hand pose from the distance matrix. Our experimental results demonstrate that this approach is both accurate, size-agnostic, and robust to external interference. We also show the design logic for the sensor selection, sensor configurations, circuit diagram, as well as model architecture

    Relation-Based Associative Joint Location for Human Pose Estimation in Videos

    Full text link
    Video-based human pose estimation (HPE) is a vital yet challenging task. While deep learning methods have made significant progress for the HPE, most approaches to this task detect each joint independently, damaging the pose structural information. In this paper, unlike the prior methods, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) to locate joints associatively. Specifically, we design a lightweight joint relation extractor (JRE) to model the pose structural features and associatively generate heatmaps for joints by modeling the relation between any two joints heuristically instead of building each joint heatmap independently. Actually, the proposed JRE module models the spatial configuration of human poses through the relationship between any two joints. Moreover, considering the temporal semantic continuity of videos, the pose semantic information in the current frame is beneficial for guiding the location of joints in the next frame. Therefore, we use the idea of knowledge reuse to propagate the pose semantic information between consecutive frames. In this way, the proposed RPSTN captures temporal dynamics of poses. On the one hand, the JRE module can infer invisible joints according to the relationship between the invisible joints and other visible joints in space. On the other hand, in the time, the propose model can transfer the pose semantic features from the non-occluded frame to the occluded frame to locate occluded joints. Therefore, our method is robust to the occlusion and achieves state-of-the-art results on the two challenging datasets, which demonstrates its effectiveness for video-based human pose estimation. We will release the code and models publicly
    • …
    corecore