308 research outputs found
Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture
Either RGB images or inertial signals have been used for the task of motion
capture (mocap), but combining them together is a new and interesting topic. We
believe that the combination is complementary and able to solve the inherent
difficulties of using one modality input, including occlusions, extreme
lighting/texture, and out-of-view for visual mocap and global drifts for
inertial mocap. To this end, we propose a method that fuses monocular images
and sparse IMUs for real-time human motion capture. Our method contains a dual
coordinate strategy to fully explore the IMU signals with different goals in
motion capture. To be specific, besides one branch transforming the IMU signals
to the camera coordinate system to combine with the image information, there is
another branch to learn from the IMU signals in the body root coordinate system
to better estimate body poses. Furthermore, a hidden state feedback mechanism
is proposed for both two branches to compensate for their own drawbacks in
extreme input cases. Thus our method can easily switch between the two kinds of
signals or combine them in different cases to achieve a robust mocap. %The two
divided parts can help each other for better mocap results under different
conditions. Quantitative and qualitative results demonstrate that by delicately
designing the fusion method, our technique significantly outperforms the
state-of-the-art vision, IMU, and combined methods on both global orientation
and local pose estimation. Our codes are available for research at
https://shaohua-pan.github.io/robustcap-page/.Comment: Accepted by SIGGRAPH ASIA 2023. Project page:
https://shaohua-pan.github.io/robustcap-page
In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
Convolutional Neural Network based approaches for monocular 3D human pose
estimation usually require a large amount of training images with 3D pose
annotations. While it is feasible to provide 2D joint annotations for large
corpora of in-the-wild images with humans, providing accurate 3D annotations to
such in-the-wild corpora is hardly feasible in practice. Most existing 3D
labelled data sets are either synthetically created or feature in-studio
images. 3D pose estimation algorithms trained on such data often have limited
ability to generalize to real world scene diversity. We therefore propose a new
deep learning based method for monocular 3D human pose estimation that shows
high accuracy and generalizes better to in-the-wild scenes. It has a network
architecture that comprises a new disentangled hidden space encoding of
explicit 2D and 3D features, and uses supervision by a new learned projection
model from predicted 3D pose. Our algorithm can be jointly trained on image
data with 3D labels and image data with only 2D labels. It achieves
state-of-the-art accuracy on challenging in-the-wild data.Comment: Accepted to CVPR 201
In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
Convolutional Neural Network based approaches for monocular 3D human pose estimation usually require a large amount of training images with 3D pose annotations. While it is feasible to provide 2D joint annotations for large corpora of in-the-wild images with humans, providing accurate 3D annotations to such in-the-wild corpora is hardly feasible in practice. Most existing 3D labelled data sets are either synthetically created or feature in-studio images. 3D pose estimation algorithms trained on such data often have limited ability to generalize to real world scene diversity. We therefore propose a new deep learning based method for monocular 3D human pose estimation that shows high accuracy and generalizes better to in-the-wild scenes. It has a network architecture that comprises a new disentangled hidden space encoding of explicit 2D and 3D features, and uses supervision by a new learned projection model from predicted 3D pose. Our algorithm can be jointly trained on image data with 3D labels and image data with only 2D labels. It achieves state-of-the-art accuracy on challenging in-the-wild data
PACE: Human and Camera Motion Estimation from in-the-wild Videos
We present a method to estimate human motion in a global scene from moving
cameras. This is a highly challenging task due to the coupling of human and
camera motions in the video. To address this problem, we propose a joint
optimization framework that disentangles human and camera motions using both
foreground human motion priors and background scene features. Unlike existing
methods that use SLAM as initialization, we propose to tightly integrate SLAM
and human motion priors in an optimization that is inspired by bundle
adjustment. Specifically, we optimize human and camera motions to match both
the observed human pose and scene features. This design combines the strengths
of SLAM and motion priors, which leads to significant improvements in human and
camera motion estimation. We additionally introduce a motion prior that is
suitable for batch optimization, making our approach significantly more
efficient than existing approaches. Finally, we propose a novel synthetic
dataset that enables evaluating camera motion in addition to human motion from
dynamic videos. Experiments on the synthetic and real-world RICH datasets
demonstrate that our approach substantially outperforms prior art in recovering
both human and camera motions.Comment: 3DV 2024. Project page: https://nvlabs.github.io/PACE
EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild
We present EMDB, the Electromagnetic Database of Global 3D Human Pose and
Shape in the Wild. EMDB is a novel dataset that contains high-quality 3D SMPL
pose and shape parameters with global body and camera trajectories for
in-the-wild videos. We use body-worn, wireless electromagnetic (EM) sensors and
a hand-held iPhone to record a total of 58 minutes of motion data, distributed
over 81 indoor and outdoor sequences and 10 participants. Together with
accurate body poses and shapes, we also provide global camera poses and body
root trajectories. To construct EMDB, we propose a multi-stage optimization
procedure, which first fits SMPL to the 6-DoF EM measurements and then refines
the poses via image observations. To achieve high-quality results, we leverage
a neural implicit avatar model to reconstruct detailed human surface geometry
and appearance, which allows for improved alignment and smoothness via a dense
pixel-level objective. Our evaluations, conducted with a multi-view volumetric
capture system, indicate that EMDB has an expected accuracy of 2.3 cm
positional and 10.6 degrees angular error, surpassing the accuracy of previous
in-the-wild datasets. We evaluate existing state-of-the-art monocular RGB
methods for camera-relative and global pose estimation on EMDB. EMDB is
publicly available under https://ait.ethz.ch/emdbComment: Accepted to ICCV 202
Tex2Shape: Detailed Full Human Body Geometry From a Single Image
We present a simple yet effective method to infer detailed full human body shape from only a single photograph. Our model can infer full-body shape including face, hair, and clothing including wrinkles at interactive frame-rates. Results feature details even on parts that are occluded in the input image. Our main idea is to turn shape regression into an aligned image-to-image translation problem. The input to our method is a partial texture map of the visible region obtained from off-the-shelf methods. From a partial texture, we estimate detailed normal and vector displacement maps, which can be applied to a low-resolution smooth body model to add detail and clothing. Despite being trained purely with synthetic data, our model generalizes well to real-world photographs. Numerous results demonstrate the versatility and robustness of our method
Tex2Shape: Detailed Full Human Body Geometry From a Single Image
We present a simple yet effective method to infer detailed full human body
shape from only a single photograph. Our model can infer full-body shape
including face, hair, and clothing including wrinkles at interactive
frame-rates. Results feature details even on parts that are occluded in the
input image. Our main idea is to turn shape regression into an aligned
image-to-image translation problem. The input to our method is a partial
texture map of the visible region obtained from off-the-shelf methods. From a
partial texture, we estimate detailed normal and vector displacement maps,
which can be applied to a low-resolution smooth body model to add detail and
clothing. Despite being trained purely with synthetic data, our model
generalizes well to real-world photographs. Numerous results demonstrate the
versatility and robustness of our method
- …