1,375 research outputs found
It's all Relative: Monocular 3D Human Pose Estimation from Weakly Supervised Data
We address the problem of 3D human pose estimation from 2D input images using
only weakly supervised training data. Despite showing considerable success for
2D pose estimation, the application of supervised machine learning to 3D pose
estimation in real world images is currently hampered by the lack of varied
training images with corresponding 3D poses. Most existing 3D pose estimation
algorithms train on data that has either been collected in carefully controlled
studio settings or has been generated synthetically. Instead, we take a
different approach, and propose a 3D human pose estimation algorithm that only
requires relative estimates of depth at training time. Such training signal,
although noisy, can be easily collected from crowd annotators, and is of
sufficient quality for enabling successful training and evaluation of 3D pose
algorithms. Our results are competitive with fully supervised regression based
approaches on the Human3.6M dataset, despite using significantly weaker
training data. Our proposed algorithm opens the door to using existing
widespread 2D datasets for 3D pose estimation by allowing fine-tuning with
noisy relative constraints, resulting in more accurate 3D poses.Comment: BMVC 2018. Project page available at
http://www.vision.caltech.edu/~mronchi/projects/RelativePos
In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
Convolutional Neural Network based approaches for monocular 3D human pose
estimation usually require a large amount of training images with 3D pose
annotations. While it is feasible to provide 2D joint annotations for large
corpora of in-the-wild images with humans, providing accurate 3D annotations to
such in-the-wild corpora is hardly feasible in practice. Most existing 3D
labelled data sets are either synthetically created or feature in-studio
images. 3D pose estimation algorithms trained on such data often have limited
ability to generalize to real world scene diversity. We therefore propose a new
deep learning based method for monocular 3D human pose estimation that shows
high accuracy and generalizes better to in-the-wild scenes. It has a network
architecture that comprises a new disentangled hidden space encoding of
explicit 2D and 3D features, and uses supervision by a new learned projection
model from predicted 3D pose. Our algorithm can be jointly trained on image
data with 3D labels and image data with only 2D labels. It achieves
state-of-the-art accuracy on challenging in-the-wild data.Comment: Accepted to CVPR 201
In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
Convolutional Neural Network based approaches for monocular 3D human pose estimation usually require a large amount of training images with 3D pose annotations. While it is feasible to provide 2D joint annotations for large corpora of in-the-wild images with humans, providing accurate 3D annotations to such in-the-wild corpora is hardly feasible in practice. Most existing 3D labelled data sets are either synthetically created or feature in-studio images. 3D pose estimation algorithms trained on such data often have limited ability to generalize to real world scene diversity. We therefore propose a new deep learning based method for monocular 3D human pose estimation that shows high accuracy and generalizes better to in-the-wild scenes. It has a network architecture that comprises a new disentangled hidden space encoding of explicit 2D and 3D features, and uses supervision by a new learned projection model from predicted 3D pose. Our algorithm can be jointly trained on image data with 3D labels and image data with only 2D labels. It achieves state-of-the-art accuracy on challenging in-the-wild data
Adversarial 3D Human Pose Estimation via Multimodal Depth Supervision
In this paper, a novel deep-learning based framework is proposed to infer 3D
human poses from a single image. Specifically, a two-phase approach is
developed. We firstly utilize a generator with two branches for the extraction
of explicit and implicit depth information respectively. During the training
process, an adversarial scheme is also employed to further improve the
performance. The implicit and explicit depth information with the estimated 2D
joints generated by a widely used estimator, in the second step, are together
fed into a deep 3D pose regressor for the final pose generation. Our method
achieves MPJPE of 58.68mm on the ECCV2018 3D Human Pose Estimation Challenge
DenseBody: Directly Regressing Dense 3D Human Pose and Shape From a Single Color Image
Recovering 3D human body shape and pose from 2D images is a challenging task
due to high complexity and flexibility of human body, and relatively less 3D
labeled data. Previous methods addressing these issues typically rely on
predicting intermediate results such as body part segmentation, 2D/3D joints,
silhouette mask to decompose the problem into multiple sub-tasks in order to
utilize more 2D labels. Most previous works incorporated parametric body shape
model in their methods and predict parameters in low-dimensional space to
represent human body. In this paper, we propose to directly regress the 3D
human mesh from a single color image using Convolutional Neural Network(CNN).
We use an efficient representation of 3D human shape and pose which can be
predicted through an encoder-decoder neural network. The proposed method
achieves state-of-the-art performance on several 3D human body datasets
including Human3.6M, SURREAL and UP-3D with even faster running speed.Comment: 10 pages, 6 figure
Not All Parts Are Created Equal: 3D Pose Estimation by Modelling Bi-directional Dependencies of Body Parts
Not all the human body parts have the same~degree of freedom~(DOF) due to the
physiological structure. For example, the limbs may move more flexibly and
freely than the torso does. Most of the existing 3D pose estimation methods,
despite the very promising results achieved, treat the body joints equally and
consequently often lead to larger reconstruction errors on the limbs. In this
paper, we propose a progressive approach that explicitly accounts for the
distinct DOFs among the body parts. We model parts with higher DOFs like the
elbows, as dependent components of the corresponding parts with lower DOFs like
the torso, of which the 3D locations can be more reliably estimated. Meanwhile,
the high-DOF parts may, in turn, impose a constraint on where the low-DOF ones
lie. As a result, parts with different DOFs supervise one another, yielding
physically constrained and plausible pose-estimation results. To further
facilitate the prediction of the high-DOF parts, we introduce a pose-attribute
estimation, where the relative location of a limb joint with respect to the
torso, which has the least DOF of a human body, is explicitly estimated and
further fed to the joint-estimation module. The proposed approach achieves very
promising results, outperforming the state of the art on several benchmarks
DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency
We present an unsupervised learning framework for simultaneously training
single-view depth prediction and optical flow estimation models using unlabeled
video sequences. Existing unsupervised methods often exploit brightness
constancy and spatial smoothness priors to train depth or flow models. In this
paper, we propose to leverage geometric consistency as additional supervisory
signals. Our core idea is that for rigid regions we can use the predicted scene
depth and camera motion to synthesize 2D optical flow by backprojecting the
induced 3D scene flow. The discrepancy between the rigid flow (from depth
prediction and camera motion) and the estimated flow (from optical flow model)
allows us to impose a cross-task consistency loss. While all the networks are
jointly optimized during training, they can be applied independently at test
time. Extensive experiments demonstrate that our depth and flow models compare
favorably with state-of-the-art unsupervised methods.Comment: ECCV 2018. Project website: http://yuliang.vision/DF-Net/ Code:
https://github.com/vt-vl-lab/DF-Ne
Taskonomy: Disentangling Task Transfer Learning
Do visual tasks have a relationship, or are they unrelated? For instance,
could having surface normals simplify estimating the depth of an image?
Intuition answers these questions positively, implying existence of a structure
among visual tasks. Knowing this structure has notable values; it is the
concept underlying transfer learning and provides a principled way for
identifying redundancies across tasks, e.g., to seamlessly reuse supervision
among related tasks or solve many tasks in one system without piling up the
complexity.
We proposes a fully computational approach for modeling the structure of
space of visual tasks. This is done via finding (first and higher-order)
transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D,
and semantic tasks in a latent space. The product is a computational taxonomic
map for task transfer learning. We study the consequences of this structure,
e.g. nontrivial emerged relationships, and exploit them to reduce the demand
for labeled data. For example, we show that the total number of labeled
datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3
(compared to training independently) while keeping the performance nearly the
same. We provide a set of tools for computing and probing this taxonomical
structure including a solver that users can employ to devise efficient
supervision policies for their use cases.Comment: CVPR 2018 (Oral). See project website and live demos at
http://taskonomy.vision
Patch-based 3D Human Pose Refinement
State-of-the-art 3D human pose estimation approaches typically estimate pose
from the entire RGB image in a single forward run. In this paper, we develop a
post-processing step to refine 3D human pose estimation from body part patches.
Using local patches as input has two advantages. First, the fine details around
body parts are zoomed in to high resolution for preciser 3D pose prediction.
Second, it enables the part appearance to be shared between poses to benefit
rare poses. In order to acquire informative representation of patches, we
explore different input modalities and validate the superiority of fusing
predicted segmentation with RGB. We show that our method consistently boosts
the accuracy of state-of-the-art 3D human pose methods.Comment: Accepted by CVPR 2019 Augmented Human: Human-centric Understanding
and 2D/3D Synthesis, and the third Look Into Person (LIP) Challenge Worksho
Out of the Box: A combined approach for handling occlusion in Human Pose Estimation
Human Pose estimation is a challenging problem, especially in the case of 3D
pose estimation from 2D images due to many different factors like occlusion,
depth ambiguities, intertwining of people, and in general crowds. 2D
multi-person human pose estimation in the wild also suffers from the same
problems - occlusion, ambiguities, and disentanglement of people's body parts.
Being a fundamental problem with loads of applications, including but not
limited to surveillance, economical motion capture for video games and movies,
and physiotherapy, this is an interesting problem to be solved both from a
practical perspective and from an intellectual perspective as well. Although
there are cases where no pose estimation can ever predict with 100% accuracy
(cases where even humans would fail), there are several algorithms that have
brought new state-of-the-art performance in human pose estimation in the wild.
We look at a few algorithms with different approaches and also formulate our
own approach to tackle a consistently bugging problem, i.e. occlusions.Comment: 11 pages, 12 figure
- …