137 research outputs found
Learning to Estimate 3D Human Pose and Shape from a Single Color Image
This work addresses the problem of estimating the full body 3D human pose and
shape from a single color image. This is a task where iterative
optimization-based solutions have typically prevailed, while Convolutional
Networks (ConvNets) have suffered because of the lack of training data and
their low resolution 3D predictions. Our work aims to bridge this gap and
proposes an efficient and effective direct prediction method based on ConvNets.
Central part to our approach is the incorporation of a parametric statistical
body shape model (SMPL) within our end-to-end framework. This allows us to get
very detailed 3D mesh results, while requiring estimation only of a small
number of parameters, making it friendly for direct network prediction.
Interestingly, we demonstrate that these parameters can be predicted reliably
only from 2D keypoints and masks. These are typical outputs of generic 2D human
analysis ConvNets, allowing us to relax the massive requirement that images
with 3D shape ground truth are available for training. Simultaneously, by
maintaining differentiability, at training time we generate the 3D mesh from
the estimated parameters and optimize explicitly for the surface using a 3D
per-vertex loss. Finally, a differentiable renderer is employed to project the
3D mesh to the image, which enables further refinement of the network, by
optimizing for the consistency of the projection with 2D annotations (i.e., 2D
keypoints or masks). The proposed approach outperforms previous baselines on
this task and offers an attractive solution for direct prediction of 3D shape
from a single color image.Comment: CVPR 2018 Camera Read
Monocular Human Pose and Shape Reconstruction using Part Differentiable Rendering
Superior human pose and shape reconstruction from monocular images depends on
removing the ambiguities caused by occlusions and shape variance. Recent works
succeed in regression-based methods which estimate parametric models directly
through a deep neural network supervised by 3D ground truth. However, 3D ground
truth is neither in abundance nor can efficiently be obtained. In this paper,
we introduce body part segmentation as critical supervision. Part segmentation
not only indicates the shape of each body part but helps to infer the
occlusions among parts as well. To improve the reconstruction with part
segmentation, we propose a part-level differentiable renderer that enables
part-based models to be supervised by part segmentation in neural networks or
optimization loops. We also introduce a general parametric model engaged in the
rendering pipeline as an intermediate representation between skeletons and
detailed shapes, which consists of primitive geometries for better
interpretability. The proposed approach combines parameter regression, body
model optimization, and detailed model registration altogether. Experimental
results demonstrate that the proposed method achieves balanced evaluation on
pose and shape, and outperforms the state-of-the-art approaches on Human3.6M,
UP-3D and LSP datasets.Comment: Accepted by Pacific Graphcis 202
Analytical Derivatives for Differentiable Renderer: 3D Pose Estimation by Silhouette Consistency
Differentiable render is widely used in optimization-based 3D reconstruction
which requires gradients from differentiable operations for gradient-based
optimization. The existing differentiable renderers obtain the gradients of
rendering via numerical technique which is of low accuracy and efficiency.
Motivated by this fact, a differentiable mesh renderer with analytical
gradients is proposed. The main obstacle of rasterization based rendering being
differentiable is the discrete sampling operation. To make the rasterization
differentiable, the pixel intensity is defined as a double integral over the
pixel area and the integral is approximated by anti-aliasing with an average
filter. Then the analytical gradients with respect to the vertices coordinates
can be derived from the continuous definition of pixel intensity. To
demonstrate the effectiveness and efficiency of the proposed differentiable
renderer, experiments of 3D pose estimation by only multi-viewpoint silhouettes
were conducted. The experimental results show that 3D pose estimation without
3D and 2D joints supervision is capable of producing competitive results both
qualitatively and quantitatively. The experimental results also show that the
proposed differentiable renderer is of higher accuracy and efficiency compared
with previous method of differentiable renderer.Comment: 19 pages, 8 figure
Single Image 3D Hand Reconstruction with Mesh Convolutions
Monocular 3D reconstruction of deformable objects, such as human body parts,
has been typically approached by predicting parameters of heavyweight linear
models. In this paper, we demonstrate an alternative solution that is based on
the idea of encoding images into a latent non-linear representation of meshes.
The prior on 3D hand shapes is learned by training an autoencoder with
intrinsic graph convolutions performed in the spectral domain. The pre-trained
decoder acts as a non-linear statistical deformable model. The latent
parameters that reconstruct the shape and articulated pose of hands in the
image are predicted using an image encoder. We show that our system
reconstructs plausible meshes and operates in real-time. We evaluate the
quality of the mesh reconstructions produced by the decoder on a new dataset
and show latent space interpolation results. Our code, data, and models will be
made publicly available.Comment: Proceedings of the British Machine Vision Conference (BMVC 2019
3D Human Mesh Regression with Dense Correspondence
Estimating 3D mesh of the human body from a single 2D image is an important
task with many applications such as augmented reality and Human-Robot
interaction. However, prior works reconstructed 3D mesh from global image
feature extracted by using convolutional neural network (CNN), where the dense
correspondences between the mesh surface and the image pixels are missing,
leading to suboptimal solution. This paper proposes a model-free 3D human mesh
estimation framework, named DecoMR, which explicitly establishes the dense
correspondence between the mesh and the local image features in the UV space
(i.e. a 2D space used for texture mapping of 3D mesh). DecoMR first predicts
pixel-to-surface dense correspondence map (i.e., IUV image), with which we
transfer local features from the image space to the UV space. Then the
transferred local image features are processed in the UV space to regress a
location map, which is well aligned with transferred features. Finally we
reconstruct 3D human mesh from the regressed location map with a predefined
mapping function. We also observe that the existing discontinuous UV map are
unfriendly to the learning of network. Therefore, we propose a novel UV map
that maintains most of the neighboring relations on the original mesh surface.
Experiments demonstrate that our proposed local feature alignment and
continuous UV map outperforms existing 3D mesh based methods on multiple public
benchmarks. Code will be made available at
https://github.com/zengwang430521/DecoMRComment: To appear at CVPR 202
A Deep Learning Approach for Multi-View Engagement Estimation of Children in a Child-Robot Joint Attention task
In this work we tackle the problem of child engagement estimation while
children freely interact with a robot in their room. We propose a deep-based
multi-view solution that takes advantage of recent developments in human pose
detection. We extract the child's pose from different RGB-D cameras placed
elegantly in the room, fuse the results and feed them to a deep neural network
trained for classifying engagement levels. The deep network contains a
recurrent layer, in order to exploit the rich temporal information contained in
the pose data. The resulting method outperforms a number of baseline
classifiers, and provides a promising tool for better automatic understanding
of a child's attitude, interest and attention while cooperating with a robot.
The goal is to integrate this model in next generation social robots as an
attention monitoring tool during various CRI tasks both for Typically Developed
(TD) children and children affected by autism (ASD).Comment: 7 pages, 6 figure
Skeleton Transformer Networks: 3D Human Pose and Skinned Mesh from Single RGB Image
In this paper, we present Skeleton Transformer Networks (SkeletonNet), an
end-to-end framework that can predict not only 3D joint positions but also 3D
angular pose (bone rotations) of a human skeleton from a single color image.
This in turn allows us to generate skinned mesh animations. Here, we propose a
two-step regression approach. The first step regresses bone rotations in order
to obtain an initial solution by considering skeleton structure. The second
step performs refinement based on heatmap regressor using a 3D pose
representation called cross heatmap which stacks heatmaps of xy and zy
coordinates. By training the network using the proposed 3D human pose dataset
that is comprised of images annotated with 3D skeletal angular poses, we showed
that SkeletonNet can predict a full 3D human pose (joint positions and bone
rotations) from a single image in-the-wild.Comment: ACCV conferenc
Detailed Human Shape Estimation from a Single Image by Hierarchical Mesh Deformation
This paper presents a novel framework to recover detailed human body shapes
from a single image. It is a challenging task due to factors such as variations
in human shapes, body poses, and viewpoints. Prior methods typically attempt to
recover the human body shape using a parametric based template that lacks the
surface details. As such the resulting body shape appears to be without
clothing. In this paper, we propose a novel learning-based framework that
combines the robustness of parametric model with the flexibility of free-form
3D deformation. We use the deep neural networks to refine the 3D shape in a
Hierarchical Mesh Deformation (HMD) framework, utilizing the constraints from
body joints, silhouettes, and per-pixel shading information. We are able to
restore detailed human body shapes beyond skinned models. Experiments
demonstrate that our method has outperformed previous state-of-the-art
approaches, achieving better accuracy in terms of both 2D IoU number and 3D
metric distance. The code is available in https://github.com/zhuhao-nju/hmd.gitComment: CVPR 2019 Ora
Self-Supervised Human Depth Estimation from Monocular Videos
Previous methods on estimating detailed human depth often require supervised
training with `ground truth' depth data. This paper presents a self-supervised
method that can be trained on YouTube videos without known depth, which makes
training data collection simple and improves the generalization of the learned
network. The self-supervised learning is achieved by minimizing a
photo-consistency loss, which is evaluated between a video frame and its
neighboring frames warped according to the estimated depth and the 3D non-rigid
motion of the human body. To solve this non-rigid motion, we first estimate a
rough SMPL model at each video frame and compute the non-rigid body motion
accordingly, which enables self-supervised learning on estimating the shape
details. Experiments demonstrate that our method enjoys better generalization
and performs much better on data in the wild.Comment: Accepted by IEEE Conference on Computer Vision and Patten Recognition
(CVPR), 202
Convolutional Mesh Regression for Single-Image Human Shape Reconstruction
This paper addresses the problem of 3D human pose and shape estimation from a
single image. Previous approaches consider a parametric model of the human
body, SMPL, and attempt to regress the model parameters that give rise to a
mesh consistent with image evidence. This parameter regression has been a very
challenging task, with model-based approaches underperforming compared to
nonparametric solutions in terms of pose estimation. In our work, we propose to
relax this heavy reliance on the model's parameter space. We still retain the
topology of the SMPL template mesh, but instead of predicting model parameters,
we directly regress the 3D location of the mesh vertices. This is a heavy task
for a typical network, but our key insight is that the regression becomes
significantly easier using a Graph-CNN. This architecture allows us to
explicitly encode the template mesh structure within the network and leverage
the spatial locality the mesh has to offer. Image-based features are attached
to the mesh vertices and the Graph-CNN is responsible to process them on the
mesh structure, while the regression target for each vertex is its 3D location.
Having recovered the complete 3D geometry of the mesh, if we still require a
specific model parametrization, this can be reliably regressed from the
vertices locations. We demonstrate the flexibility and the effectiveness of our
proposed graph-based mesh regression by attaching different types of features
on the mesh vertices. In all cases, we outperform the comparable baselines
relying on model parameter regression, while we also achieve state-of-the-art
results among model-based pose estimation approaches.Comment: To appear at CVPR 2019 (Oral Presentation). Project page:
https://www.seas.upenn.edu/~nkolot/projects/cmr
- …