48,704 research outputs found

    Cascaded 3D Full-body Pose Regression from Single Depth Image at 100 FPS

    Full text link
    There are increasing real-time live applications in virtual reality, where it plays an important role in capturing and retargetting 3D human pose. But it is still challenging to estimate accurate 3D pose from consumer imaging devices such as depth camera. This paper presents a novel cascaded 3D full-body pose regression method to estimate accurate pose from a single depth image at 100 fps. The key idea is to train cascaded regressors based on Gradient Boosting algorithm from pre-recorded human motion capture database. By incorporating hierarchical kinematics model of human pose into the learning procedure, we can directly estimate accurate 3D joint angles instead of joint positions. The biggest advantage of this model is that the bone length can be preserved during the whole 3D pose estimation procedure, which leads to more effective features and higher pose estimation accuracy. Our method can be used as an initialization procedure when combining with tracking methods. We demonstrate the power of our method on a wide range of synthesized human motion data from CMU mocap database, Human3.6M dataset and real human movements data captured in real time. In our comparison against previous 3D pose estimation methods and commercial system such as Kinect 2017, we achieve the state-of-the-art accuracy

    V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map

    Full text link
    Most of the existing deep learning-based methods for 3D hand and human pose estimation from a single depth map are based on a common framework that takes a 2D depth map and directly regresses the 3D coordinates of keypoints, such as hand or human body joints, via 2D convolutional neural networks (CNNs). The first weakness of this approach is the presence of perspective distortion in the 2D depth map. While the depth map is intrinsically 3D data, many previous methods treat depth maps as 2D images that can distort the shape of the actual object through projection from 3D to 2D space. This compels the network to perform perspective distortion-invariant estimation. The second weakness of the conventional approach is that directly regressing 3D coordinates from a 2D image is a highly non-linear mapping, which causes difficulty in the learning procedure. To overcome these weaknesses, we firstly cast the 3D hand and human pose estimation problem from a single depth map into a voxel-to-voxel prediction that uses a 3D voxelized grid and estimates the per-voxel likelihood for each keypoint. We design our model as a 3D CNN that provides accurate estimates while running in real-time. Our system outperforms previous methods in almost all publicly available 3D hand and human pose estimation datasets and placed first in the HANDS 2017 frame-based 3D hand pose estimation challenge. The code is available in https://github.com/mks0601/V2V-PoseNet_RELEASE.Comment: HANDS 2017 Challenge Frame-based 3D Hand Pose Estimation Winner (ICCV 2017), Published at CVPR 201

    Scene-aware Egocentric 3D Human Pose Estimation

    Full text link
    Egocentric 3D human pose estimation with a single head-mounted fisheye camera has recently attracted attention due to its numerous applications in virtual and augmented reality. Existing methods still struggle in challenging poses where the human body is highly occluded or is closely interacting with the scene. To address this issue, we propose a scene-aware egocentric pose estimation method that guides the prediction of the egocentric pose with scene constraints. To this end, we propose an egocentric depth estimation network to predict the scene depth map from a wide-view egocentric fisheye camera while mitigating the occlusion of the human body with a depth-inpainting network. Next, we propose a scene-aware pose estimation network that projects the 2D image features and estimated depth map of the scene into a voxel space and regresses the 3D pose with a V2V network. The voxel-based feature representation provides the direct geometric connection between 2D image features and scene geometry, and further facilitates the V2V network to constrain the predicted pose based on the estimated scene geometry. To enable the training of the aforementioned networks, we also generated a synthetic dataset, called EgoGTA, and an in-the-wild dataset based on EgoPW, called EgoPW-Scene. The experimental results of our new evaluation sequences show that the predicted 3D egocentric poses are accurate and physically plausible in terms of human-scene interaction, demonstrating that our method outperforms the state-of-the-art methods both quantitatively and qualitatively

    Hybrid One-Shot 3D Hand Pose Estimation by Exploiting Uncertainties

    Full text link
    Model-based approaches to 3D hand tracking have been shown to perform well in a wide range of scenarios. However, they require initialisation and cannot recover easily from tracking failures that occur due to fast hand motions. Data-driven approaches, on the other hand, can quickly deliver a solution, but the results often suffer from lower accuracy or missing anatomical validity compared to those obtained from model-based approaches. In this work we propose a hybrid approach for hand pose estimation from a single depth image. First, a learned regressor is employed to deliver multiple initial hypotheses for the 3D position of each hand joint. Subsequently, the kinematic parameters of a 3D hand model are found by deliberately exploiting the inherent uncertainty of the inferred joint proposals. This way, the method provides anatomically valid and accurate solutions without requiring manual initialisation or suffering from track losses. Quantitative results on several standard datasets demonstrate that the proposed method outperforms state-of-the-art representatives of the model-based, data-driven and hybrid paradigms.Comment: BMVC 2015 (oral); see also http://lrs.icg.tugraz.at/research/hybridhape

    An investigation into image-based indoor localization using deep learning

    Get PDF
    Localization is one of the fundamental technologies for many applications such as location-based service ( LBS ), robotics, virtual reality ( VR ), autonomous driving, and pedestrians navigation. Traditional methods based on wireless signals and inertial measurement unit (IMU) have inherent disadvantages which limit their applications. Although image-based localization methods seem to be promising supplements to previous methods, their applications in the indoor scenario have many challenges. Compared to the outdoor environments, indoors are more dynamic which adds difficulty to map construction. Also, indoor scenes tend to be more similar to each other which makes it difficult to distinguish different places with a similar appearance. Besides, how to utilize widely available 3D indoor structures to enhance the localization performance remains to be well explored. Deep learning techniques have achieved significant progress in many computer vision tasks such as image classification, object detection, monocular depth prediction amongst others. However, their application to indoor image-based localization has not yet been well studied. In this thesis, we investigate image-based indoor localization through deep learning techniques. We study the problem from two perspectives: topological localization and metric localization. Topological localization tries to obtain a coarse location whilst metric localization aims to provide accurate pose, which includes both position and orientation. We also study indoor image localization with the assistance of 3D maps by taking advantage of the availability of many 3D maps of indoor scenes. We have made the following contributions: Our first contribution is an indoor topological localization framework inspired by the human self-localization strategy. In this framework, we propose a novel topological map representation that is robust to environmental changes. Unlike previous topological maps, which are constructed by dividing the indoor scenes geometrically, and each region is represented by the aggregation of features derived from the whole region, our topological map is constructed based on the fixed indoor elements and each node is represented with their semantic attributes. Besides, an effective landmark detector is devised to extract semantic information of the objects of interest from the smart-phone video. We also present a new localization algorithm to match the detected semantic landmark sequence against the proposed semantic topological map through their semantic and contextual information. Experiments are conducted on two test sites and results show that our landmark detector is capable of accurately detecting the landmarks and the localization algorithm can perform localization accurately. The second contribution is that we advocate a direct learning-based method using convolutional neural networks (CNNs \nomenclature{CNNs}{Convolutional Neural Networks}) to exploit the relative geometry constraints between images for image-based metric localization. We have developed a new convolutional neural network to predict the global poses and the relative pose of two images simultaneously. This multi-tasking learning strategy allows mutual regularizations for both the global pose regression and the relative pose regression. Furthermore, we designed a new loss function that embeds the relative pose information to distinguish the poses of similar images of different locations. We conduct extensive experiments to validate the effectiveness of the proposed method on two image localization benchmarks and achieve state-of-the-art performance compared to the other learning-based methods. Our third contribution is a single image localization framework in a 3D map. To the best of our knowledge, it is the first approach to localize a single image in a 3D map. The framework includes four main steps: pose initialization, depth inference, local map extraction, and pose correction. The pose initialization step estimates the coarse pose with the learning-based pose regression approach. The depth inference step predicts the dense depth map from the single image. The local map extraction step extracts a local map from the global 3D map to increase the efficiency. Given the local map and generated point cloud, the Iterative Closest Point (ICP \nomenclature{ICP}{Iterative Closest Point}) algorithm is conducted to align the point cloud to the local map and then compute the pose correction of the coarse pose. As the key of the method is to accurately predict the depth from the images, a novel 3D map guided single image depth prediction approach is proposed. The proposed method utilized both the 3D map and the RGB image where we use the RGB image to estimate a dense depth map and employ the 3D map to guide the depth estimation. We show that our new method significantly outperforms current RGB image-based depth estimation methods for both indoor and outdoor datasets. We also show that utilizing the depth map predicted by the new method for single indoor image localization can improve both position and orientation localization accuracy over state-of-the-art methods
    • …
    corecore