314 research outputs found
Pushing the envelope for estimating poses and actions via full 3D reconstruction
Estimating poses and actions of human bodies and hands is an important task in the computer vision community due to its vast applications, including human
computer interaction, virtual reality and augmented reality, medical image analysis. Challenges: There are many in-the-wild challenges in this task (see chapter 1). Among them, in this thesis, we focused on two challenges which could be relieved by incorporating the 3D geometry: (1) inherent 2D-to-3D ambiguity driven by the non-linear 2D projection process when capturing 3D objects. (2) lack of sufficient and quality annotated datasets due to the high-dimensionality of subjects' attribute space and inherent difficulty in annotating 3D coordinate values. Contributions: We first tried to jointly tackle the 2D-to-3D ambiguity and insufficient data issues by (1) explicitly reconstructing 2.5D and 3D samples and use them as new training data to train a pose estimator. Next, we tried to (2) encode 3D geometry in the training process of the action recognizer to reduce the 2D-to-3D ambiguity. In appendix, we proposed a (3) new hand pose synthetic dataset that can be used for more complete attribute changes and multi-modal experiments in the future. Experiments: Throughout experiments, we found interesting facts: (1) 2.5D depth map reconstruction and data augmentation can improve the accuracy of the depth-based hand pose estimation algorithm, (2) 3D mesh reconstruction can be used to generate a new RGB data and it improves the accuracy of RGB-based dense hand pose estimation algorithm, (3) 3D geometry from 3D poses and scene layouts could be successfully utilized to reduce the 2D-to-3D ambiguity in the action recognition problem.Open Acces
Accurate 6D Object Pose Estimation by Pose Conditioned Mesh Reconstruction
Current 6D object pose methods consist of deep CNN models fully optimized for
a single object but with its architecture standardized among objects with
different shapes. In contrast to previous works, we explicitly exploit each
object's distinct topological information i.e. 3D dense meshes in the pose
estimation model, with an automated process and prior to any post-processing
refinement stage. In order to achieve this, we propose a learning framework in
which a Graph Convolutional Neural Network reconstructs a pose conditioned 3D
mesh of the object. A robust estimation of the allocentric orientation is
recovered by computing, in a differentiable manner, the Procrustes' alignment
between the canonical and reconstructed dense 3D meshes. 6D egocentric pose is
then lifted using additional mask and 2D centroid projection estimations. Our
method is capable of self validating its pose estimation by measuring the
quality of the reconstructed mesh, which is invaluable in real life
applications. In our experiments on the LINEMOD, OCCLUSION and YCB-Video
benchmarks, the proposed method outperforms state-of-the-arts
3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal
Estimating 3D interacting hand pose from a single RGB image is essential for
understanding human actions. Unlike most previous works that directly predict
the 3D poses of two interacting hands simultaneously, we propose to decompose
the challenging interacting hand pose estimation task and estimate the pose of
each hand separately. In this way, it is straightforward to take advantage of
the latest research progress on the single-hand pose estimation system.
However, hand pose estimation in interacting scenarios is very challenging, due
to (1) severe hand-hand occlusion and (2) ambiguity caused by the homogeneous
appearance of hands. To tackle these two challenges, we propose a novel Hand
De-occlusion and Removal (HDR) framework to perform hand de-occlusion and
distractor removal. We also propose the first large-scale synthetic amodal hand
dataset, termed Amodal InterHand Dataset (AIH), to facilitate model training
and promote the development of the related research. Experiments show that the
proposed method significantly outperforms previous state-of-the-art interacting
hand pose estimation approaches. Codes and data are available at
https://github.com/MengHao666/HDR.Comment: ECCV202
End-to-end Weakly-supervised Multiple 3D Hand Mesh Reconstruction from Single Image
In this paper, we consider the challenging task of simultaneously locating
and recovering multiple hands from single 2D image. Previous studies either
focus on single hand reconstruction or solve this problem in a multi-stage way.
Moreover, the conventional two-stage pipeline firstly detects hand areas, and
then estimates 3D hand pose from each cropped patch. To reduce the
computational redundancy in preprocessing and feature extraction, we propose a
concise but efficient single-stage pipeline. Specifically, we design a
multi-head auto-encoder structure for multi-hand reconstruction, where each
head network shares the same feature map and outputs the hand center, pose and
texture, respectively. Besides, we adopt a weakly-supervised scheme to
alleviate the burden of expensive 3D real-world data annotations. To this end,
we propose a series of losses optimized by a stage-wise training scheme, where
a multi-hand dataset with 2D annotations is generated based on the publicly
available single hand datasets. In order to further improve the accuracy of the
weakly supervised model, we adopt several feature consistency constraints in
both single and multiple hand settings. Specifically, the keypoints of each
hand estimated from local features should be consistent with the re-projected
points predicted from global features. Extensive experiments on public
benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our
method outperforms the state-of-the-art model-based methods in both
weakly-supervised and fully-supervised manners
State of the Art in Dense Monocular Non-Rigid 3D Reconstruction
3D reconstruction of deformable (or non-rigid) scenes from a set of monocular2D image observations is a long-standing and actively researched area ofcomputer vision and graphics. It is an ill-posed inverse problem,since--without additional prior assumptions--it permits infinitely manysolutions leading to accurate projection to the input 2D images. Non-rigidreconstruction is a foundational building block for downstream applicationslike robotics, AR/VR, or visual content creation. The key advantage of usingmonocular cameras is their omnipresence and availability to the end users aswell as their ease of use compared to more sophisticated camera set-ups such asstereo or multi-view systems. This survey focuses on state-of-the-art methodsfor dense non-rigid 3D reconstruction of various deformable objects andcomposite scenes from monocular videos or sets of monocular views. It reviewsthe fundamentals of 3D reconstruction and deformation modeling from 2D imageobservations. We then start from general methods--that handle arbitrary scenesand make only a few prior assumptions--and proceed towards techniques makingstronger assumptions about the observed objects and types of deformations (e.g.human faces, bodies, hands, and animals). A significant part of this STAR isalso devoted to classification and a high-level comparison of the methods, aswell as an overview of the datasets for training and evaluation of thediscussed techniques. We conclude by discussing open challenges in the fieldand the social aspects associated with the usage of the reviewed methods.<br
- …