6,457 research outputs found
Learning Monocular 3D Human Pose Estimation from Multi-view Images
Accurate 3D human pose estimation from single images is possible with
sophisticated deep-net architectures that have been trained on very large
datasets. However, this still leaves open the problem of capturing motions for
which no such database exists. Manual annotation is tedious, slow, and
error-prone. In this paper, we propose to replace most of the annotations by
the use of multiple views, at training time only. Specifically, we train the
system to predict the same pose in all views. Such a consistency constraint is
necessary but not sufficient to predict accurate poses. We therefore complement
it with a supervised loss aiming to predict the correct pose in a small set of
labeled images, and with a regularization term that penalizes drift from
initial predictions. Furthermore, we propose a method to estimate camera pose
jointly with human pose, which lets us utilize multi-view footage where
calibration is difficult, e.g., for pan-tilt or moving handheld cameras. We
demonstrate the effectiveness of our approach on established benchmarks, as
well as on a new Ski dataset with rotating cameras and expert ski motion, for
which annotations are truly hard to obtain.Comment: CVPR 2018, Ski-Pose PTZ-Camera Dataset availabl
3D Human Pose and Shape Estimation Based on Parametric Model and Deep Learning
3D human body reconstruction from monocular images has wide applications in our life, such as movie, animation, Virtual/Augmented Reality, medical research and so on. Due to the high freedom of human body in real scene and the ambiguity of inferring 3D objects from 2D images, it is a challenging task to accurately recover 3D human body models from images. In this thesis, we explore the methods for estimating 3D human body models from images based on parametric model and deep learning.In the first part, the coarse 3D human body models are estimated automatically from multi-view images based on a parametric human body model called SMPL model. Two routes are exploited for estimating the pose and shape parameters of the SMPL model to obtain the 3D models: (1) Optimization based methods; and (2) Deep learning based methods. For the optimization based methods, we propose the novel energy functions based on some prior information including the 2D joint points and silhouettes. Through minimizing the energy functions, the SMPL model is fitted to the prior information, and then, the coarse 3D human body is obtained. In addition to the traditional optimization based methods, a deep learning based method is also proposed in the following work to regress the pose and shape parameters of the SMPL model. A novel architecture is proposed to put the optimization into a training loop of convolutional neural network (CNN) to form a self-supervision structure based on the multi-view images. The proposed methods are evaluated on both synthetic and real datasets to demonstrate that they can obtain better estimation of the pose and shape of 3D human body than previous approaches.In the second part, the problem is shifted to the detailed 3D human body reconstruction from multi-view images. Instead of using the SMPL model, implicit function is utilized to represent 3D models because implicit representation can generate continuous surface and has better flexibility for arbitrary topology. Firstly, a multi-scale features based method is proposed to learn the implicit representation for 3D models through multi-stage hourglass networks from multi-view images. Furthermore, a coarse-to-fine method is proposed to refine the 3D models from multi-view images through learning the voxel super-resolution. In this method, the coarse 3D models are estimated firstly by the learned implicit function based on multi-scale features from multi-view images. Afterwards, by voxelizing the coarse 3D models to low resolution voxel grids, voxel super-resolution is learned through a multi-stage 3D CNN for feature extraction from low resolution voxel grids and fully connected neural network for predicting the implicit function. Voxel super-resolution is able to remove the false reconstruction and preserve the surface details. The proposed methods are evaluated on both real and synthetic datasets in which our method can estimate 3D model with higher accuracy and better surface quality than some previous methods
Deep Reinforcement Learning for Active Human Pose Estimation
Most 3d human pose estimation methods assume that input -- be it images of a
scene collected from one or several viewpoints, or from a video -- is given.
Consequently, they focus on estimates leveraging prior knowledge and
measurement by fusing information spatially and/or temporally, whenever
available. In this paper we address the problem of an active observer with
freedom to move and explore the scene spatially -- in `time-freeze' mode --
and/or temporally, by selecting informative viewpoints that improve its
estimation accuracy. Towards this end, we introduce Pose-DRL, a fully trainable
deep reinforcement learning-based active pose estimation architecture which
learns to select appropriate views, in space and time, to feed an underlying
monocular pose estimator. We evaluate our model using single- and multi-target
estimators with strong result in both settings. Our system further learns
automatic stopping conditions in time and transition functions to the next
temporal processing step in videos. In extensive experiments with the Panoptic
multi-view setup, and for complex scenes containing multiple people, we show
that our model learns to select viewpoints that yield significantly more
accurate pose estimates compared to strong multi-view baselines.Comment: Accepted to The Thirty-Fourth AAAI Conference on Artificial
Intelligence (AAAI-20). Submission updated to include supplementary materia
In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
Convolutional Neural Network based approaches for monocular 3D human pose
estimation usually require a large amount of training images with 3D pose
annotations. While it is feasible to provide 2D joint annotations for large
corpora of in-the-wild images with humans, providing accurate 3D annotations to
such in-the-wild corpora is hardly feasible in practice. Most existing 3D
labelled data sets are either synthetically created or feature in-studio
images. 3D pose estimation algorithms trained on such data often have limited
ability to generalize to real world scene diversity. We therefore propose a new
deep learning based method for monocular 3D human pose estimation that shows
high accuracy and generalizes better to in-the-wild scenes. It has a network
architecture that comprises a new disentangled hidden space encoding of
explicit 2D and 3D features, and uses supervision by a new learned projection
model from predicted 3D pose. Our algorithm can be jointly trained on image
data with 3D labels and image data with only 2D labels. It achieves
state-of-the-art accuracy on challenging in-the-wild data.Comment: Accepted to CVPR 201
Single-Shot Multi-Person 3D Pose Estimation From Monocular RGB
We propose a new single-shot method for multi-person 3D pose estimation in
general scenes from a monocular RGB camera. Our approach uses novel
occlusion-robust pose-maps (ORPM) which enable full body pose inference even
under strong partial occlusions by other people and objects in the scene. ORPM
outputs a fixed number of maps which encode the 3D joint locations of all
people in the scene. Body part associations allow us to infer 3D pose for an
arbitrary number of people without explicit bounding box prediction. To train
our approach we introduce MuCo-3DHP, the first large scale training data set
showing real images of sophisticated multi-person interactions and occlusions.
We synthesize a large corpus of multi-person images by compositing images of
individual people (with ground truth from mutli-view performance capture). We
evaluate our method on our new challenging 3D annotated multi-person test set
MuPoTs-3D where we achieve state-of-the-art performance. To further stimulate
research in multi-person 3D pose estimation, we will make our new datasets, and
associated code publicly available for research purposes.Comment: International Conference on 3D Vision (3DV), 201
MonoPerfCap: Human Performance Capture from Monocular Video
We present the first marker-less approach for temporally coherent 3D
performance capture of a human with general clothing from monocular video. Our
approach reconstructs articulated human skeleton motion as well as medium-scale
non-rigid surface deformations in general scenes. Human performance capture is
a challenging problem due to the large range of articulation, potentially fast
motion, and considerable non-rigid deformations, even from multi-view data.
Reconstruction from monocular video alone is drastically more challenging,
since strong occlusions and the inherent depth ambiguity lead to a highly
ill-posed reconstruction problem. We tackle these challenges by a novel
approach that employs sparse 2D and 3D human pose detections from a
convolutional neural network using a batch-based pose estimation strategy.
Joint recovery of per-batch motion allows to resolve the ambiguities of the
monocular reconstruction problem based on a low dimensional trajectory
subspace. In addition, we propose refinement of the surface geometry based on
fully automatically extracted silhouettes to enable medium-scale non-rigid
alignment. We demonstrate state-of-the-art performance capture results that
enable exciting applications such as video editing and free viewpoint video,
previously infeasible from monocular video. Our qualitative and quantitative
evaluation demonstrates that our approach significantly outperforms previous
monocular methods in terms of accuracy, robustness and scene complexity that
can be handled.Comment: Accepted to ACM TOG 2018, to be presented on SIGGRAPH 201
Recurrent 3D Pose Sequence Machines
3D human articulated pose recovery from monocular image sequences is very
challenging due to the diverse appearances, viewpoints, occlusions, and also
the human 3D pose is inherently ambiguous from the monocular imagery. It is
thus critical to exploit rich spatial and temporal long-range dependencies
among body joints for accurate 3D pose sequence prediction. Existing approaches
usually manually design some elaborate prior terms and human body kinematic
constraints for capturing structures, which are often insufficient to exploit
all intrinsic structures and not scalable for all scenarios. In contrast, this
paper presents a Recurrent 3D Pose Sequence Machine(RPSM) to automatically
learn the image-dependent structural constraint and sequence-dependent temporal
context by using a multi-stage sequential refinement. At each stage, our RPSM
is composed of three modules to predict the 3D pose sequences based on the
previously learned 2D pose representations and 3D poses: (i) a 2D pose module
extracting the image-dependent pose representations, (ii) a 3D pose recurrent
module regressing 3D poses and (iii) a feature adaption module serving as a
bridge between module (i) and (ii) to enable the representation transformation
from 2D to 3D domain. These three modules are then assembled into a sequential
prediction framework to refine the predicted poses with multiple recurrent
stages. Extensive evaluations on the Human3.6M dataset and HumanEva-I dataset
show that our RPSM outperforms all state-of-the-art approaches for 3D pose
estimation.Comment: Published in CVPR 201
- …