18,810 research outputs found
AHRNet: Attention and heatmap-based regressor for hand pose estimation and mesh recovery
Estimating 3D hand pose and recovering the full hand surface mesh from a single RGB image is a challenging task due to self-occlusions, viewpoint changes, and the complexity of hand articulations. In this paper, we propose a novel framework that combines an attention mechanism with heatmap regression to accurately and efficiently predict 3D joint locations and reconstruct the hand mesh. We adopt a pooling attention module that learns to focus on relevant regions in the input image to extract better features for handling occlusions, while greatly reducing the computational cost. The multi-scale 2D heatmaps provide spatial constraints to guide the 3D vertex predictions. By exploiting the complementary strengths of sparse 2D supervision and dense mesh regression, our method accurately reconstructs hand meshes with realistic details. Extensive experiments on standard benchmarks demonstrate that the proposed method efficiently improves the performance of 3D hand pose estimation and mesh recovery. The reproducible recipes are available at https://github.com/SDiannn/AHRNET-Heatmap
Generalizing Gaze Estimation with Weak-Supervision from Synthetic Views
Developing gaze estimation models that generalize well to unseen domains and
in-the-wild conditions remains a challenge with no known best solution. This is
mostly due to the difficulty of acquiring ground truth data that cover the
distribution of possible faces, head poses and environmental conditions that
exist in the real world. In this work, we propose to train general gaze
estimation models based on 3D geometry-aware gaze pseudo-annotations which we
extract from arbitrary unlabelled face images, which are abundantly available
in the internet. Additionally, we leverage the observation that head, body and
hand pose estimation benefit from revising them as dense 3D coordinate
prediction, and similarly express gaze estimation as regression of dense 3D eye
meshes. We overcome the absence of compatible ground truth by fitting rigid 3D
eyeballs on existing gaze datasets and design a multi-view supervision
framework to balance the effect of pseudo-labels during training. We test our
method in the task of gaze generalization, in which we demonstrate improvement
of up to compared to state-of-the-art when no ground truth data are
available, and up to when they are. The project material will become
available for research purposes.Comment: 13 pages, 12 figure
- …