59,545 research outputs found
3D Object Reconstruction from Hand-Object Interactions
Recent advances have enabled 3d object reconstruction approaches using a
single off-the-shelf RGB-D camera. Although these approaches are successful for
a wide range of object classes, they rely on stable and distinctive geometric
or texture features. Many objects like mechanical parts, toys, household or
decorative articles, however, are textureless and characterized by minimalistic
shapes that are simple and symmetric. Existing in-hand scanning systems and 3d
reconstruction techniques fail for such symmetric objects in the absence of
highly distinctive features. In this work, we show that extracting 3d hand
motion for in-hand scanning effectively facilitates the reconstruction of even
featureless and highly symmetric objects and we present an approach that fuses
the rich additional information of hands into a 3d reconstruction pipeline,
significantly contributing to the state-of-the-art of in-hand scanning.Comment: International Conference on Computer Vision (ICCV) 2015,
http://files.is.tue.mpg.de/dtzionas/In-Hand-Scannin
ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map
3D reconstruction of hand-object manipulations is important for emulating
human actions. Most methods dealing with challenging object manipulation
scenarios, focus on hands reconstruction in isolation, ignoring physical and
kinematic constraints due to object contact. Some approaches produce more
realistic results by jointly reconstructing 3D hand-object interactions.
However, they focus on coarse pose estimation or rely upon known hand and
object shapes. We propose the first approach for realistic 3D hand-object shape
and pose reconstruction from a single depth map. Unlike previous work, our
voxel-based reconstruction network regresses the vertex coordinates of a hand
and an object and reconstructs more realistic interaction. Our pipeline
additionally predicts voxelized hand-object shapes, having a one-to-one mapping
to the input voxelized depth. Thereafter, we exploit the graph nature of the
hand and object shapes, by utilizing the recent GraFormer network with
positional embedding to reconstruct shapes from template meshes. In addition,
we show the impact of adding another GraFormer component that refines the
reconstructed shapes based on the hand-object interactions and its ability to
reconstruct more accurate object shapes. We perform an extensive evaluation on
the HO-3D and DexYCB datasets and show that our method outperforms existing
approaches in hand reconstruction and produces plausible reconstructions for
the object
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video
Since humans interact with diverse objects every day, the holistic 3D capture
of these interactions is important to understand and model human behaviour.
However, most existing methods for hand-object reconstruction from RGB either
assume pre-scanned object templates or heavily rely on limited 3D hand-object
data, restricting their ability to scale and generalize to more unconstrained
interaction settings. To this end, we introduce HOLD -- the first
category-agnostic method that reconstructs an articulated hand and object
jointly from a monocular interaction video. We develop a compositional
articulated implicit model that can reconstruct disentangled 3D hand and object
from 2D images. We also further incorporate hand-object constraints to improve
hand-object poses and consequently the reconstruction quality. Our method does
not rely on 3D hand-object annotations while outperforming fully-supervised
baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we
qualitatively show its robustness in reconstructing from in-the-wild videos.
Code: https://github.com/zc-alexfan/hol
MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision
Previous works concerning single-view hand-held object reconstruction
typically utilize supervision from 3D ground truth models, which are hard to
collect in real world. In contrast, abundant videos depicting hand-object
interactions can be accessed easily with low cost, although they only give
partial object observations with complex occlusion. In this paper, we present
MOHO to reconstruct hand-held object from a single image with multi-view
supervision from hand-object videos, tackling two predominant challenges
including object's self-occlusion and hand-induced occlusion. MOHO inputs
semantic features indicating visible object parts and geometric embeddings
provided by hand articulations as partial-to-full cues to resist object's
self-occlusion, so as to recover full shape of the object. Meanwhile, a novel
2D-3D hand-occlusion-aware training scheme following the synthetic-to-real
paradigm is proposed to release hand-induced occlusion. In the synthetic
pre-training stage, 2D-3D hand-object correlations are constructed by
supervising MOHO with rendered images to complete the hand-concealed regions of
the object in both 2D and 3D space. Subsequently, MOHO is finetuned in real
world by the mask-weighted volume rendering supervision adopting hand-object
correlations obtained during pre-training. Extensive experiments on HO3D and
DexYCB datasets demonstrate that 2D-supervised MOHO gains superior results
against 3D-supervised methods by a large margin. Codes and key assets will be
released soon
UV-Based 3D Hand-Object Reconstruction with Grasp Optimization
We propose a novel framework for 3D hand shape reconstruction and hand-object
grasp optimization from a single RGB image. The representation of hand-object
contact regions is critical for accurate reconstructions. Instead of
approximating the contact regions with sparse points, as in previous works, we
propose a dense representation in the form of a UV coordinate map. Furthermore,
we introduce inference-time optimization to fine-tune the grasp and improve
interactions between the hand and the object. Our pipeline increases hand shape
reconstruction accuracy and produces a vibrant hand texture. Experiments on
datasets such as Ho3D, FreiHAND, and DexYCB reveal that our proposed method
outperforms the state-of-the-art.Comment: BMVC 2022 Spotligh
HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields
Human hands are highly articulated and versatile at handling objects. Jointly
estimating the 3D poses of a hand and the object it manipulates from a
monocular camera is challenging due to frequent occlusions. Thus, existing
methods often rely on intermediate 3D shape representations to increase
performance. These representations are typically explicit, such as 3D point
clouds or meshes, and thus provide information in the direct surroundings of
the intermediate hand pose estimate. To address this, we introduce HOISDF, a
Signed Distance Field (SDF) guided hand-object pose estimation network, which
jointly exploits hand and object SDFs to provide a global, implicit
representation over the complete reconstruction volume. Specifically, the role
of the SDFs is threefold: equip the visual encoder with implicit shape
information, help to encode hand-object interactions, and guide the hand and
object pose regression via SDF-based sampling and by augmenting the feature
representations. We show that HOISDF achieves state-of-the-art results on
hand-object pose estimation benchmarks (DexYCB and HO3Dv2). Code is available
at https://github.com/amathislab/HOISDFComment: Accepted at CVPR 2024. 9 figures, many table
{TOCH}: {S}patio-Temporal Object Correspondence to Hand for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous work focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. The key component is a point-wise object-centric representation which encodes the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for 1) correcting erroneous reconstruction results from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising, and 3) grasp transfer across objects. We will release our code and trained model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous work focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. The key component is a point-wise object-centric representation which encodes the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for 1) correcting erroneous reconstruction results from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising, and 3) grasp transfer across objects. We will release our code and trained model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch
TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction
sequences using a data prior. Existing hand trackers, especially those that
rely on very few cameras, often produce visually unrealistic results with
hand-object intersection or missing contacts. Although correcting such errors
requires reasoning about temporal aspects of interaction, most previous works
focus on static grasps and contacts. The core of our method are TOCH fields, a
novel spatio-temporal representation for modeling correspondences between hands
and objects during interaction. TOCH fields are a point-wise, object-centric
representation, which encode the hand position relative to the object.
Leveraging this novel representation, we learn a latent manifold of plausible
TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate
that TOCH outperforms state-of-the-art 3D hand-object interaction models, which
are limited to static grasps and contacts. More importantly, our method
produces smooth interactions even before and after contact. Using a single
trained TOCH model, we quantitatively and qualitatively demonstrate its
usefulness for correcting erroneous sequences from off-the-shelf RGB/RGB-D
hand-object reconstruction methods and transferring grasps across objects
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips
We tackle the task of reconstructing hand-object interactions from short
video clips. Given an input video, our approach casts 3D inference as a
per-video optimization and recovers a neural 3D representation of the object
shape, as well as the time-varying motion and hand articulation. While the
input video naturally provides some multi-view cues to guide 3D inference,
these are insufficient on their own due to occlusions and limited viewpoint
variations. To obtain accurate 3D, we augment the multi-view signals with
generic data-driven priors to guide reconstruction. Specifically, we learn a
diffusion network to model the conditional distribution of (geometric)
renderings of objects conditioned on hand configuration and category label, and
leverage it as a prior to guide the novel-view renderings of the reconstructed
scene. We empirically evaluate our approach on egocentric videos across 6
object categories, and observe significant improvements over prior single-view
and multi-view methods. Finally, we demonstrate our system's ability to
reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person
interactions.Comment: Accepted to ICCV23 (Oral). Project Page:
https://judyye.github.io/diffhoi-www
- …