1,612 research outputs found
MonoPerfCap: Human Performance Capture from Monocular Video
We present the first marker-less approach for temporally coherent 3D
performance capture of a human with general clothing from monocular video. Our
approach reconstructs articulated human skeleton motion as well as medium-scale
non-rigid surface deformations in general scenes. Human performance capture is
a challenging problem due to the large range of articulation, potentially fast
motion, and considerable non-rigid deformations, even from multi-view data.
Reconstruction from monocular video alone is drastically more challenging,
since strong occlusions and the inherent depth ambiguity lead to a highly
ill-posed reconstruction problem. We tackle these challenges by a novel
approach that employs sparse 2D and 3D human pose detections from a
convolutional neural network using a batch-based pose estimation strategy.
Joint recovery of per-batch motion allows to resolve the ambiguities of the
monocular reconstruction problem based on a low dimensional trajectory
subspace. In addition, we propose refinement of the surface geometry based on
fully automatically extracted silhouettes to enable medium-scale non-rigid
alignment. We demonstrate state-of-the-art performance capture results that
enable exciting applications such as video editing and free viewpoint video,
previously infeasible from monocular video. Our qualitative and quantitative
evaluation demonstrates that our approach significantly outperforms previous
monocular methods in terms of accuracy, robustness and scene complexity that
can be handled.Comment: Accepted to ACM TOG 2018, to be presented on SIGGRAPH 201
PREF: Predictability Regularized Neural Motion Fields
Knowing the 3D motions in a dynamic scene is essential to many vision
applications. Recent progress is mainly focused on estimating the activity of
some specific elements like humans. In this paper, we leverage a neural motion
field for estimating the motion of all points in a multiview setting. Modeling
the motion from a dynamic scene with multiview data is challenging due to the
ambiguities in points of similar color and points with time-varying color. We
propose to regularize the estimated motion to be predictable. If the motion
from previous frames is known, then the motion in the near future should be
predictable. Therefore, we introduce a predictability regularization by first
conditioning the estimated motion on latent embeddings, then by adopting a
predictor network to enforce predictability on the embeddings. The proposed
framework PREF (Predictability REgularized Fields) achieves on par or better
results than state-of-the-art neural motion field-based dynamic scene
representation methods, while requiring no prior knowledge of the scene.Comment: Accepted at ECCV 2022 (oral). Paper + supplementary materia
Hand Keypoint Detection in Single Images using Multiview Bootstrapping
We present an approach that uses a multi-camera system to train fine-grained
detectors for keypoints that are prone to occlusion, such as the joints of a
hand. We call this procedure multiview bootstrapping: first, an initial
keypoint detector is used to produce noisy labels in multiple views of the
hand. The noisy detections are then triangulated in 3D using multiview geometry
or marked as outliers. Finally, the reprojected triangulations are used as new
labeled training data to improve the detector. We repeat this process,
generating more labeled data in each iteration. We derive a result analytically
relating the minimum number of views to achieve target true and false positive
rates for a given detector. The method is used to train a hand keypoint
detector for single images. The resulting keypoint detector runs in realtime on
RGB images and has accuracy comparable to methods that use depth sensors. The
single view detector, triangulated over multiple views, enables 3D markerless
hand motion capture with complex object interactions.Comment: CVPR 201
- …