149 research outputs found
Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction
International audienceModeling hand-object manipulations is essential for understanding how humans interact with their environment. While of practical importance, estimating the pose of hands and objects during interactions is challenging due to the large mutual occlusions that occur during manipulation. Recent efforts have been directed towards fully-supervised methods that require large amounts of labeled training samples. Collecting 3D ground-truth data for hand-object interactions, however, is costly, tedious, and error-prone. To overcome this challenge we present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video. Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses. Given our estimated reconstructions, we differentiably render the optical flow between pairs of adjacent images and use it within the network to warp one frame to another. We then apply a self-supervised photometric loss that relies on the visual consistency between nearby images. We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy by leveraging information from neighboring frames in low-data regimes
Learning monocular 3D reconstruction of articulated categories from motion
Monocular 3D reconstruction of articulated object categories is challenging
due to the lack of training data and the inherent ill-posedness of the problem.
In this work we use video self-supervision, forcing the consistency of
consecutive 3D reconstructions by a motion-based cycle loss. This largely
improves both optimization-based and learning-based 3D mesh reconstruction. We
further introduce an interpretable model of 3D template deformations that
controls a 3D surface through the displacement of a small number of local,
learnable handles. We formulate this operation as a structured layer relying on
mesh-laplacian regularization and show that it can be trained in an end-to-end
manner. We finally introduce a per-sample numerical optimisation approach that
jointly optimises over mesh displacements and cameras within a video, boosting
accuracy both for training and also as test time post-processing. While relying
exclusively on a small set of videos collected per category for supervision, we
obtain state-of-the-art reconstructions with diverse shapes, viewpoints and
textures for multiple articulated object categories.Comment: For project website see
https://fkokkinos.github.io/video_3d_reconstruction
PanopticNeRF-360: Panoramic 3D-to-2D Label Transfer in Urban Scenes
Training perception systems for self-driving cars requires substantial
annotations. However, manual labeling in 2D images is highly labor-intensive.
While existing datasets provide rich annotations for pre-recorded sequences,
they fall short in labeling rarely encountered viewpoints, potentially
hampering the generalization ability for perception models. In this paper, we
present PanopticNeRF-360, a novel approach that combines coarse 3D annotations
with noisy 2D semantic cues to generate consistent panoptic labels and
high-quality images from any viewpoint. Our key insight lies in exploiting the
complementarity of 3D and 2D priors to mutually enhance geometry and semantics.
Specifically, we propose to leverage noisy semantic and instance labels in both
3D and 2D spaces to guide geometry optimization. Simultaneously, the improved
geometry assists in filtering noise present in the 3D and 2D annotations by
merging them in 3D space via a learned semantic field. To further enhance
appearance, we combine MLP and hash grids to yield hybrid scene features,
striking a balance between high-frequency appearance and predominantly
contiguous semantics. Our experiments demonstrate PanopticNeRF-360's
state-of-the-art performance over existing label transfer methods on the
challenging urban scenes of the KITTI-360 dataset. Moreover, PanopticNeRF-360
enables omnidirectional rendering of high-fidelity, multi-view and
spatiotemporally consistent appearance, semantic and instance labels. We make
our code and data available at https://github.com/fuxiao0719/PanopticNeRFComment: Project page: http://fuxiao0719.github.io/projects/panopticnerf360/.
arXiv admin note: text overlap with arXiv:2203.1522
UV-Based 3D Hand-Object Reconstruction with Grasp Optimization
We propose a novel framework for 3D hand shape reconstruction and hand-object
grasp optimization from a single RGB image. The representation of hand-object
contact regions is critical for accurate reconstructions. Instead of
approximating the contact regions with sparse points, as in previous works, we
propose a dense representation in the form of a UV coordinate map. Furthermore,
we introduce inference-time optimization to fine-tune the grasp and improve
interactions between the hand and the object. Our pipeline increases hand shape
reconstruction accuracy and produces a vibrant hand texture. Experiments on
datasets such as Ho3D, FreiHAND, and DexYCB reveal that our proposed method
outperforms the state-of-the-art.Comment: BMVC 2022 Spotligh
End-to-end Weakly-supervised Multiple 3D Hand Mesh Reconstruction from Single Image
In this paper, we consider the challenging task of simultaneously locating
and recovering multiple hands from single 2D image. Previous studies either
focus on single hand reconstruction or solve this problem in a multi-stage way.
Moreover, the conventional two-stage pipeline firstly detects hand areas, and
then estimates 3D hand pose from each cropped patch. To reduce the
computational redundancy in preprocessing and feature extraction, we propose a
concise but efficient single-stage pipeline. Specifically, we design a
multi-head auto-encoder structure for multi-hand reconstruction, where each
head network shares the same feature map and outputs the hand center, pose and
texture, respectively. Besides, we adopt a weakly-supervised scheme to
alleviate the burden of expensive 3D real-world data annotations. To this end,
we propose a series of losses optimized by a stage-wise training scheme, where
a multi-hand dataset with 2D annotations is generated based on the publicly
available single hand datasets. In order to further improve the accuracy of the
weakly supervised model, we adopt several feature consistency constraints in
both single and multiple hand settings. Specifically, the keypoints of each
hand estimated from local features should be consistent with the re-projected
points predicted from global features. Extensive experiments on public
benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our
method outperforms the state-of-the-art model-based methods in both
weakly-supervised and fully-supervised manners
- …