4,557 research outputs found
MonoPerfCap: Human Performance Capture from Monocular Video
We present the first marker-less approach for temporally coherent 3D
performance capture of a human with general clothing from monocular video. Our
approach reconstructs articulated human skeleton motion as well as medium-scale
non-rigid surface deformations in general scenes. Human performance capture is
a challenging problem due to the large range of articulation, potentially fast
motion, and considerable non-rigid deformations, even from multi-view data.
Reconstruction from monocular video alone is drastically more challenging,
since strong occlusions and the inherent depth ambiguity lead to a highly
ill-posed reconstruction problem. We tackle these challenges by a novel
approach that employs sparse 2D and 3D human pose detections from a
convolutional neural network using a batch-based pose estimation strategy.
Joint recovery of per-batch motion allows to resolve the ambiguities of the
monocular reconstruction problem based on a low dimensional trajectory
subspace. In addition, we propose refinement of the surface geometry based on
fully automatically extracted silhouettes to enable medium-scale non-rigid
alignment. We demonstrate state-of-the-art performance capture results that
enable exciting applications such as video editing and free viewpoint video,
previously infeasible from monocular video. Our qualitative and quantitative
evaluation demonstrates that our approach significantly outperforms previous
monocular methods in terms of accuracy, robustness and scene complexity that
can be handled.Comment: Accepted to ACM TOG 2018, to be presented on SIGGRAPH 201
TöRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis
Neural networks can represent and accurately reconstruct radiance fields for
static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes
captured with monocular video, with promising performance. However, the
monocular setting is known to be an under-constrained problem, and so methods
rely on data-driven priors for reconstructing dynamic content. We replace these
priors with measurements from a time-of-flight (ToF) camera, and introduce a
neural representation based on an image formation model for continuous-wave ToF
cameras. Instead of working with processed depth maps, we model the raw ToF
sensor measurements to improve reconstruction quality and avoid issues with low
reflectance regions, multi-path interference, and a sensor's limited
unambiguous depth range. We show that this approach improves robustness of
dynamic scene reconstruction to erroneous calibration and large motions, and
discuss the benefits and limitations of integrating RGB+ToF sensors that are
now available on modern smartphones.Comment: Accepted to NeurIPS 2021. Web page: https://imaging.cs.cmu.edu/torf/
NeurIPS camera ready updates -- added quantitative comparisons to new
methods, visual side-by-side comparisons performed on larger baseline camera
sequence
LiveCap: Real-time Human Performance Capture from Monocular Video
We present the first real-time human performance capture approach that
reconstructs dense, space-time coherent deforming geometry of entire humans in
general everyday clothing from just a single RGB video. We propose a novel
two-stage analysis-by-synthesis optimization whose formulation and
implementation are designed for high performance. In the first stage, a skinned
template model is jointly fitted to background subtracted input video, 2D and
3D skeleton joint positions found using a deep neural network, and a set of
sparse facial landmark detections. In the second stage, dense non-rigid 3D
deformations of skin and even loose apparel are captured based on a novel
real-time capable algorithm for non-rigid tracking using dense photometric and
silhouette constraints. Our novel energy formulation leverages automatically
identified material regions on the template to model the differing non-rigid
deformation behavior of skin and apparel. The two resulting non-linear
optimization problems per-frame are solved with specially-tailored
data-parallel Gauss-Newton solvers. In order to achieve real-time performance
of over 25Hz, we design a pipelined parallel architecture using the CPU and two
commodity GPUs. Our method is the first real-time monocular approach for
full-body performance capture. Our method yields comparable accuracy with
off-line performance capture techniques, while being orders of magnitude
faster
Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos
Learning to predict scene depth from RGB inputs is a challenging task both
for indoor and outdoor robot navigation. In this work we address unsupervised
learning of scene depth and robot ego-motion where supervision is provided by
monocular videos, as cameras are the cheapest, least restrictive and most
ubiquitous sensor for robotics.
Previous work in unsupervised image-to-depth learning has established strong
baselines in the domain. We propose a novel approach which produces higher
quality results, is able to model moving objects and is shown to transfer
across data domains, e.g. from outdoors to indoor scenes. The main idea is to
introduce geometric structure in the learning process, by modeling the scene
and the individual objects; camera ego-motion and object motions are learned
from monocular videos as input. Furthermore an online refinement method is
introduced to adapt learning on the fly to unknown domains.
The proposed approach outperforms all state-of-the-art approaches, including
those that handle motion e.g. through learned flow. Our results are comparable
in quality to the ones which used stereo as supervision and significantly
improve depth prediction on scenes and datasets which contain a lot of object
motion. The approach is of practical relevance, as it allows transfer across
environments, by transferring models trained on data collected for robot
navigation in urban scenes to indoor navigation settings. The code associated
with this paper can be found at https://sites.google.com/view/struct2depth.Comment: Thirty-Third AAAI Conference on Artificial Intelligence (AAAI'19
Dense Piecewise Planar RGB-D SLAM for Indoor Environments
The paper exploits weak Manhattan constraints to parse the structure of
indoor environments from RGB-D video sequences in an online setting. We extend
the previous approach for single view parsing of indoor scenes to video
sequences and formulate the problem of recovering the floor plan of the
environment as an optimal labeling problem solved using dynamic programming.
The temporal continuity is enforced in a recursive setting, where labeling from
previous frames is used as a prior term in the objective function. In addition
to recovery of piecewise planar weak Manhattan structure of the extended
environment, the orthogonality constraints are also exploited by visual
odometry and pose graph optimization. This yields reliable estimates in the
presence of large motions and absence of distinctive features to track. We
evaluate our method on several challenging indoors sequences demonstrating
accurate SLAM and dense mapping of low texture environments. On existing TUM
benchmark we achieve competitive results with the alternative approaches which
fail in our environments.Comment: International Conference on Intelligent Robots and Systems (IROS)
201
- …