14 research outputs found
VIBE: Video Inference for Human Body Pose and Shape Estimation
Human motion is fundamental to understanding behavior. Despite progress on
single-image 3D pose and shape estimation, existing video-based
state-of-the-art methods fail to produce accurate and natural motion sequences
due to a lack of ground-truth 3D motion data for training. To address this
problem, we propose Video Inference for Body Pose and Shape Estimation (VIBE),
which makes use of an existing large-scale motion capture dataset (AMASS)
together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty
is an adversarial learning framework that leverages AMASS to discriminate
between real human motions and those produced by our temporal pose and shape
regression networks. We define a temporal network architecture and show that
adversarial training, at the sequence level, produces kinematically plausible
motion sequences without in-the-wild ground-truth 3D labels. We perform
extensive experimentation to analyze the importance of motion and demonstrate
the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving
state-of-the-art performance. Code and pretrained models are available at
https://github.com/mkocabas/VIBE.Comment: CVPR-2020 camera ready. Code is available at
https://github.com/mkocabas/VIB
PC-HMR: Pose Calibration for 3D Human Mesh Recovery from 2D Images/Videos
The end-to-end Human Mesh Recovery (HMR) approach has been successfully used
for 3D body reconstruction. However, most HMR-based frameworks reconstruct
human body by directly learning mesh parameters from images or videos, while
lacking explicit guidance of 3D human pose in visual data. As a result, the
generated mesh often exhibits incorrect pose for complex activities. To tackle
this problem, we propose to exploit 3D pose to calibrate human mesh.
Specifically, we develop two novel Pose Calibration frameworks, i.e., Serial
PC-HMR and Parallel PC-HMR. By coupling advanced 3D pose estimators and HMR in
a serial or parallel manner, these two frameworks can effectively correct human
mesh with guidance of a concise pose calibration module. Furthermore, since the
calibration module is designed via non-rigid pose transformation, our PC-HMR
frameworks can flexibly tackle bone length variations to alleviate misplacement
in the calibrated mesh. Finally, our frameworks are based on generic and
complementary integration of data-driven learning and geometrical modeling. Via
plug-and-play modules, they can be efficiently adapted for both
image/video-based human mesh recovery. Additionally, they have no requirement
of extra 3D pose annotations in the testing phase, which releases inference
difficulties in practice. We perform extensive experiments on the popular
bench-marks, i.e., Human3.6M, 3DPW and SURREAL, where our PC-HMR frameworks
achieve the SOTA results.Comment: 9 pages, 7 figures. AAAI202
Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos
Despite the recent progress, 3D multi-person pose estimation from monocular
videos is still challenging due to the commonly encountered problem of missing
information caused by occlusion, partially out-of-frame target persons, and
inaccurate person detection.To tackle this problem, we propose a novel
framework integrating graph convolutional networks (GCNs) and temporal
convolutional networks (TCNs) to robustly estimate camera-centric multi-person
3D poses that do not require camera parameters. In particular, we introduce a
human-joint GCN, which unlike the existing GCN, is based on a directed graph
that employs the 2D pose estimator's confidence scores to improve the pose
estimation results. We also introduce a human-bone GCN, which models the bone
connections and provides more information beyond human joints. The two GCNs
work together to estimate the spatial frame-wise 3D poses and can make use of
both visible joint and bone information in the target frame to estimate the
occluded or missing human-part information. To further refine the 3D pose
estimation, we use our temporal convolutional networks (TCNs) to enforce the
temporal and human-dynamics constraints. We use a joint-TCN to estimate
person-centric 3D poses across frames, and propose a velocity-TCN to estimate
the speed of 3D joints to ensure the consistency of the 3D pose estimation in
consecutive frames. Finally, to estimate the 3D human poses for multiple
persons, we propose a root-TCN that estimates camera-centric 3D poses without
requiring camera parameters. Quantitative and qualitative evaluations
demonstrate the effectiveness of the proposed method.Comment: 10 pages, 3 figures, Accepted to AAAI 202