9 research outputs found
Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation
We propose a robust and accurate method for estimating the 3D poses of two
hands in close interaction from a single color image. This is a very
challenging problem, as large occlusions and many confusions between the joints
may happen. State-of-the-art methods solve this problem by regressing a heatmap
for each joint, which requires solving two problems simultaneously: localizing
the joints and recognizing them. In this work, we propose to separate these
tasks by relying on a CNN to first localize joints as 2D keypoints, and on
self-attention between the CNN features at these keypoints to associate them
with the corresponding hand joint. The resulting architecture, which we call
"Keypoint Transformer", is highly efficient as it achieves state-of-the-art
performance with roughly half the number of model parameters on the
InterHand2.6M dataset. We also show it can be easily extended to estimate the
3D pose of an object manipulated by one or two hands with high performance.
Moreover, we created a new dataset of more than 75,000 images of two hands
manipulating an object fully annotated in 3D and will make it publicly
available.Comment: Accepted at CVPR202
In-Hand 3D Object Scanning from an RGB Sequence
We propose a method for in-hand 3D scanning of an unknown object with a
monocular camera. Our method relies on a neural implicit surface representation
that captures both the geometry and the appearance of the object, however, by
contrast with most NeRF-based methods, we do not assume that the camera-object
relative poses are known. Instead, we simultaneously optimize both the object
shape and the pose trajectory. As direct optimization over all shape and pose
parameters is prone to fail without coarse-level initialization, we propose an
incremental approach that starts by splitting the sequence into carefully
selected overlapping segments within which the optimization is likely to
succeed. We reconstruct the object shape and track its poses independently
within each segment, then merge all the segments before performing a global
optimization. We show that our method is able to reconstruct the shape and
color of both textured and challenging texture-less objects, outperforms
classical methods that rely only on appearance features, and that its
performance is close to recent methods that assume known camera poses.Comment: CVPR 202
Honnotate: A method for 3D annotation of hand and object pose
International audienc
Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation
Accepted at CVPR2022International audienceWe propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and recognizing them. In this work, we propose to separate these tasks by relying on a CNN to first localize joints as 2D keypoints, and on self-attention between the CNN features at these keypoints to associate them with the corresponding hand joint. The resulting architecture, which we call "Keypoint Transformer", is highly efficient as it achieves state-of-the-art performance with roughly half the number of model parameters on the InterHand2.6M dataset. We also show it can be easily extended to estimate the 3D pose of an object manipulated by one or two hands with high performance. Moreover, we created a new dataset of more than 75,000 images of two hands manipulating an object fully annotated in 3D and will make it publicly available
Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation
Accepted at CVPR2022International audienceWe propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and recognizing them. In this work, we propose to separate these tasks by relying on a CNN to first localize joints as 2D keypoints, and on self-attention between the CNN features at these keypoints to associate them with the corresponding hand joint. The resulting architecture, which we call "Keypoint Transformer", is highly efficient as it achieves state-of-the-art performance with roughly half the number of model parameters on the InterHand2.6M dataset. We also show it can be easily extended to estimate the 3D pose of an object manipulated by one or two hands with high performance. Moreover, we created a new dataset of more than 75,000 images of two hands manipulating an object fully annotated in 3D and will make it publicly available
Monte Carlo Scene Search for 3D Scene Understanding
International audienc
DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
Generating natural hand-object interactions in 3D is challenging as the
resulting hand and object motions are expected to be physically plausible and
semantically meaningful. Furthermore, generalization to unseen objects is
hindered by the limited scale of available hand-object interaction datasets. We
propose DiffH2O, a novel method to synthesize realistic, one or two-handed
object interactions from provided text prompts and geometry of the object. The
method introduces three techniques that enable effective learning from limited
data. First, we decompose the task into a grasping stage and a text-based
interaction stage and use separate diffusion models for each. In the grasping
stage, the model only generates hand motions, whereas in the interaction phase
both hand and object poses are synthesized. Second, we propose a compact
representation that tightly couples hand and object poses. Third, we propose
two different guidance schemes to allow more control of the generated motions:
grasp guidance and detailed textual guidance. Grasp guidance takes a single
target grasping pose and guides the diffusion model to reach this grasp at the
end of the grasping stage, which provides control over the grasping pose. Given
a grasping motion from this stage, multiple different actions can be prompted
in the interaction phase. For textual guidance, we contribute comprehensive
text descriptions to the GRAB dataset and show that they enable our method to
have more fine-grained control over hand-object interactions. Our quantitative
and qualitative evaluation demonstrates that the proposed method outperforms
baseline methods and leads to natural hand-object motions. Moreover, we
demonstrate the practicality of our framework by utilizing a hand pose estimate
from an off-the-shelf pose estimator for guidance, and then sampling multiple
different actions in the interaction stage.Comment: Project Page: https://diffh2o.github.io