7 research outputs found
4D Human Body Capture from Egocentric Video via 3D Scene Grounding
We introduce a novel task of reconstructing a time series of second-person 3D
human body meshes from monocular egocentric videos. The unique viewpoint and
rapid embodied camera motion of egocentric videos raise additional technical
barriers for human body capture. To address those challenges, we propose a
simple yet effective optimization-based approach that leverages 2D observations
of the entire video sequence and human-scene interaction constraint to estimate
second-person human poses, shapes, and global motion that are grounded on the
3D environment captured from the egocentric view. We conduct detailed ablation
studies to validate our design choice. Moreover, we compare our method with the
previous state-of-the-art method on human motion capture from monocular video,
and show that our method estimates more accurate human-body poses and shapes
under the challenging egocentric setting. In addition, we demonstrate that our
approach produces more realistic human-scene interaction
Monocular Expressive Body Regression through Body-Driven Attention
To understand how people look, interact, or perform tasks, we need to quickly
and accurately capture their 3D body, face, and hands together from an RGB
image. Most existing methods focus only on parts of the body. A few recent
approaches reconstruct full expressive 3D humans from images using 3D body
models that include the face and hands. These methods are optimization-based
and thus slow, prone to local optima, and require 2D keypoints as input. We
address these limitations by introducing ExPose (EXpressive POse and Shape
rEgression), which directly regresses the body, face, and hands, in SMPL-X
format, from an RGB image. This is a hard problem due to the high
dimensionality of the body and the lack of expressive training data.
Additionally, hands and faces are much smaller than the body, occupying very
few image pixels. This makes hand and face estimation hard when body images are
downscaled for neural networks. We make three main contributions. First, we
account for the lack of training data by curating a dataset of SMPL-X fits on
in-the-wild images. Second, we observe that body estimation localizes the face
and hands reasonably well. We introduce body-driven attention for face and hand
regions in the original image to extract higher-resolution crops that are fed
to dedicated refinement modules. Third, these modules exploit part-specific
knowledge from existing face- and hand-only datasets. ExPose estimates
expressive 3D humans more accurately than existing optimization methods at a
small fraction of the computational cost. Our data, model and code are
available for research at https://expose.is.tue.mpg.de .Comment: Accepted in ECCV'20. Project page: http://expose.is.tue.mpg.d
PhysCap: Physically Plausible Monocular 3D Motion Capture in Real Time
Marker-less 3D human motion capture from a single colour camera has seen
significant progress. However, it is a very challenging and severely ill-posed
problem. In consequence, even the most accurate state-of-the-art approaches
have significant limitations. Purely kinematic formulations on the basis of
individual joints or skeletons, and the frequent frame-wise reconstruction in
state-of-the-art methods greatly limit 3D accuracy and temporal stability
compared to multi-view or marker-based motion capture. Further, captured 3D
poses are often physically incorrect and biomechanically implausible, or
exhibit implausible environment interactions (floor penetration, foot skating,
unnatural body leaning and strong shifting in depth), which is problematic for
any use case in computer graphics. We, therefore, present PhysCap, the first
algorithm for physically plausible, real-time and marker-less human 3D motion
capture with a single colour camera at 25 fps. Our algorithm first captures 3D
human poses purely kinematically. To this end, a CNN infers 2D and 3D joint
positions, and subsequently, an inverse kinematics step finds space-time
coherent joint angles and global 3D pose. Next, these kinematic reconstructions
are used as constraints in a real-time physics-based pose optimiser that
accounts for environment constraints (e.g., collision handling and floor
placement), gravity, and biophysical plausibility of human postures. Our
approach employs a combination of ground reaction force and residual force for
plausible root control, and uses a trained neural network to detect foot
contact events in images. Our method captures physically plausible and
temporally stable global 3D human motion, without physically implausible
postures, floor penetrations or foot skating, from video in real time and in
general scenes. The video is available at
http://gvv.mpi-inf.mpg.de/projects/PhysCapComment: 16 pages, 11 figure
GRAB: A Dataset of Whole-Body Human Grasping of Objects
Training computers to understand, model, and synthesize human grasping
requires a rich dataset containing complex 3D object shapes, detailed contact
information, hand pose and shape, and the 3D body motion over time. While
"grasping" is commonly thought of as a single hand stably lifting an object, we
capture the motion of the entire body and adopt the generalized notion of
"whole-body grasps". Thus, we collect a new dataset, called GRAB (GRasping
Actions with Bodies), of whole-body grasps, containing full 3D shape and pose
sequences of 10 subjects interacting with 51 everyday objects of varying shape
and size. Given MoCap markers, we fit the full 3D body shape and pose,
including the articulated face and hands, as well as the 3D object pose. This
gives detailed 3D meshes over time, from which we compute contact between the
body and object. This is a unique dataset, that goes well beyond existing ones
for modeling and understanding how humans grasp and manipulate objects, how
their full body is involved, and how interaction varies with the task. We
illustrate the practical value of GRAB with an example application; we train
GrabNet, a conditional generative network, to predict 3D hand grasps for unseen
3D object shapes. The dataset and code are available for research purposes at
https://grab.is.tue.mpg.de.Comment: ECCV 202
Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video
International audienceIn this paper, we introduce a method to automatically reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person and the object, contact positions, and forces and torques actuated by the human limbs. The main contributions of this work are three-fold. First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of their interactions. This is cast as a large-scale trajectory optimization problem. Second, we develop a method to automatically recognize from the input video the position and timing of contacts between the person and the object or the ground, thereby significantly simplifying the complexity of the optimization. Third, we validate our approach on a recent MoCap dataset with ground truth contact forces and demonstrate its performance on a new dataset of Internet videos showing people manipulating a variety of tools in unconstrained environments