18,935 research outputs found
Capturing Hands in Action using Discriminative Salient Points and Physics Simulation
Hand motion capture is a popular research field, recently gaining more
attention due to the ubiquity of RGB-D sensors. However, even most recent
approaches focus on the case of a single isolated hand. In this work, we focus
on hands that interact with other hands or objects and present a framework that
successfully captures motion in such interaction scenarios for both rigid and
articulated objects. Our framework combines a generative model with
discriminatively trained salient points to achieve a low tracking error and
with collision detection and physics simulation to achieve physically plausible
estimates even in case of occlusions and missing visual data. Since all
components are unified in a single objective function which is almost
everywhere differentiable, it can be optimized with standard optimization
techniques. Our approach works for monocular RGB-D sequences as well as setups
with multiple synchronized RGB cameras. For a qualitative and quantitative
evaluation, we captured 29 sequences with a large variety of interactions and
up to 150 degrees of freedom.Comment: Accepted for publication by the International Journal of Computer
Vision (IJCV) on 16.02.2016 (submitted on 17.10.14). A combination into a
single framework of an ECCV'12 multicamera-RGB and a monocular-RGBD GCPR'14
hand tracking paper with several extensions, additional experiments and
detail
MonoPerfCap: Human Performance Capture from Monocular Video
We present the first marker-less approach for temporally coherent 3D
performance capture of a human with general clothing from monocular video. Our
approach reconstructs articulated human skeleton motion as well as medium-scale
non-rigid surface deformations in general scenes. Human performance capture is
a challenging problem due to the large range of articulation, potentially fast
motion, and considerable non-rigid deformations, even from multi-view data.
Reconstruction from monocular video alone is drastically more challenging,
since strong occlusions and the inherent depth ambiguity lead to a highly
ill-posed reconstruction problem. We tackle these challenges by a novel
approach that employs sparse 2D and 3D human pose detections from a
convolutional neural network using a batch-based pose estimation strategy.
Joint recovery of per-batch motion allows to resolve the ambiguities of the
monocular reconstruction problem based on a low dimensional trajectory
subspace. In addition, we propose refinement of the surface geometry based on
fully automatically extracted silhouettes to enable medium-scale non-rigid
alignment. We demonstrate state-of-the-art performance capture results that
enable exciting applications such as video editing and free viewpoint video,
previously infeasible from monocular video. Our qualitative and quantitative
evaluation demonstrates that our approach significantly outperforms previous
monocular methods in terms of accuracy, robustness and scene complexity that
can be handled.Comment: Accepted to ACM TOG 2018, to be presented on SIGGRAPH 201
LiveCap: Real-time Human Performance Capture from Monocular Video
We present the first real-time human performance capture approach that
reconstructs dense, space-time coherent deforming geometry of entire humans in
general everyday clothing from just a single RGB video. We propose a novel
two-stage analysis-by-synthesis optimization whose formulation and
implementation are designed for high performance. In the first stage, a skinned
template model is jointly fitted to background subtracted input video, 2D and
3D skeleton joint positions found using a deep neural network, and a set of
sparse facial landmark detections. In the second stage, dense non-rigid 3D
deformations of skin and even loose apparel are captured based on a novel
real-time capable algorithm for non-rigid tracking using dense photometric and
silhouette constraints. Our novel energy formulation leverages automatically
identified material regions on the template to model the differing non-rigid
deformation behavior of skin and apparel. The two resulting non-linear
optimization problems per-frame are solved with specially-tailored
data-parallel Gauss-Newton solvers. In order to achieve real-time performance
of over 25Hz, we design a pipelined parallel architecture using the CPU and two
commodity GPUs. Our method is the first real-time monocular approach for
full-body performance capture. Our method yields comparable accuracy with
off-line performance capture techniques, while being orders of magnitude
faster
Occlusion-Robust MVO: Multimotion Estimation Through Occlusion Via Motion Closure
Visual motion estimation is an integral and well-studied challenge in
autonomous navigation. Recent work has focused on addressing multimotion
estimation, which is especially challenging in highly dynamic environments.
Such environments not only comprise multiple, complex motions but also tend to
exhibit significant occlusion.
Previous work in object tracking focuses on maintaining the integrity of
object tracks but usually relies on specific appearance-based descriptors or
constrained motion models. These approaches are very effective in specific
applications but do not generalize to the full multimotion estimation problem.
This paper presents a pipeline for estimating multiple motions, including the
camera egomotion, in the presence of occlusions. This approach uses an
expressive motion prior to estimate the SE (3) trajectory of every motion in
the scene, even during temporary occlusions, and identify the reappearance of
motions through motion closure. The performance of this occlusion-robust
multimotion visual odometry (MVO) pipeline is evaluated on real-world data and
the Oxford Multimotion Dataset.Comment: To appear at the 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS). An earlier version of this work first
appeared at the Long-term Human Motion Planning Workshop (ICRA 2019). 8
pages, 5 figures. Video available at
https://www.youtube.com/watch?v=o_N71AA6FR
RGBD Datasets: Past, Present and Future
Since the launch of the Microsoft Kinect, scores of RGBD datasets have been
released. These have propelled advances in areas from reconstruction to gesture
recognition. In this paper we explore the field, reviewing datasets across
eight categories: semantics, object pose estimation, camera tracking, scene
reconstruction, object tracking, human actions, faces and identification. By
extracting relevant information in each category we help researchers to find
appropriate data for their needs, and we consider which datasets have succeeded
in driving computer vision forward and why.
Finally, we examine the future of RGBD datasets. We identify key areas which
are currently underexplored, and suggest that future directions may include
synthetic data and dense reconstructions of static and dynamic scenes.Comment: 8 pages excluding references (CVPR style
Realtime Multilevel Crowd Tracking using Reciprocal Velocity Obstacles
We present a novel, realtime algorithm to compute the trajectory of each
pedestrian in moderately dense crowd scenes. Our formulation is based on an
adaptive particle filtering scheme that uses a multi-agent motion model based
on velocity-obstacles, and takes into account local interactions as well as
physical and personal constraints of each pedestrian. Our method dynamically
changes the number of particles allocated to each pedestrian based on different
confidence metrics. Additionally, we use a new high-definition crowd video
dataset, which is used to evaluate the performance of different pedestrian
tracking algorithms. This dataset consists of videos of indoor and outdoor
scenes, recorded at different locations with 30-80 pedestrians. We highlight
the performance benefits of our algorithm over prior techniques using this
dataset. In practice, our algorithm can compute trajectories of tens of
pedestrians on a multi-core desktop CPU at interactive rates (27-30 frames per
second). To the best of our knowledge, our approach is 4-5 times faster than
prior methods, which provide similar accuracy
Hand Keypoint Detection in Single Images using Multiview Bootstrapping
We present an approach that uses a multi-camera system to train fine-grained
detectors for keypoints that are prone to occlusion, such as the joints of a
hand. We call this procedure multiview bootstrapping: first, an initial
keypoint detector is used to produce noisy labels in multiple views of the
hand. The noisy detections are then triangulated in 3D using multiview geometry
or marked as outliers. Finally, the reprojected triangulations are used as new
labeled training data to improve the detector. We repeat this process,
generating more labeled data in each iteration. We derive a result analytically
relating the minimum number of views to achieve target true and false positive
rates for a given detector. The method is used to train a hand keypoint
detector for single images. The resulting keypoint detector runs in realtime on
RGB images and has accuracy comparable to methods that use depth sensors. The
single view detector, triangulated over multiple views, enables 3D markerless
hand motion capture with complex object interactions.Comment: CVPR 201
- …