6 research outputs found
Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion
Accurate 3D reconstruction of the hand and object shape from a hand-object
image is important for understanding human-object interaction as well as human
daily activities. Different from bare hand pose estimation, hand-object
interaction poses a strong constraint on both the hand and its manipulated
object, which suggests that hand configuration may be crucial contextual
information for the object, and vice versa. However, current approaches address
this task by training a two-branch network to reconstruct the hand and object
separately with little communication between the two branches. In this work, we
propose to consider hand and object jointly in feature space and explore the
reciprocity of the two branches. We extensively investigate cross-branch
feature fusion architectures with MLP or LSTM units. Among the investigated
architectures, a variant with LSTM units that enhances object feature with hand
feature shows the best performance gain. Moreover, we employ an auxiliary depth
estimation module to augment the input RGB image with the estimated depth map,
which further improves the reconstruction accuracy. Experiments conducted on
public datasets demonstrate that our approach significantly outperforms
existing approaches in terms of the reconstruction accuracy of objects.Comment: Accepted by IEEE Transactions on Image Processing (TIP
InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild
Understanding human interaction with objects is an important research topic
for embodied Artificial Intelligence and identifying the objects that humans
are interacting with is a primary problem for interaction understanding.
Existing methods rely on frame-based detectors to locate interacting objects.
However, this approach is subjected to heavy occlusions, background clutter,
and distracting objects. To address the limitations, in this paper, we propose
to leverage spatio-temporal information of hand-object interaction to track
interactive objects under these challenging cases. Without prior knowledge of
the general objects to be tracked like object tracking problems, we first
utilize the spatial relation between hands and objects to adaptively discover
the interacting objects from the scene. Second, the consistency and continuity
of the appearance of objects between successive frames are exploited to track
the objects. With this tracking formulation, our method also benefits from
training on large-scale general object-tracking datasets. We further curate a
video-level hand-object interaction dataset for testing and evaluation from
100DOH. The quantitative results demonstrate that our proposed method
outperforms the state-of-the-art methods. Specifically, in scenes with
continuous interaction with different objects, we achieve an impressive
improvement of about 10% as evaluated using the Average Precision (AP) metric.
Our qualitative findings also illustrate that our method can produce more
continuous trajectories for interacting objects.Comment: IROS 202
TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction
sequences using a data prior. Existing hand trackers, especially those that
rely on very few cameras, often produce visually unrealistic results with
hand-object intersection or missing contacts. Although correcting such errors
requires reasoning about temporal aspects of interaction, most previous works
focus on static grasps and contacts. The core of our method are TOCH fields, a
novel spatio-temporal representation for modeling correspondences between hands
and objects during interaction. TOCH fields are a point-wise, object-centric
representation, which encode the hand position relative to the object.
Leveraging this novel representation, we learn a latent manifold of plausible
TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate
that TOCH outperforms state-of-the-art 3D hand-object interaction models, which
are limited to static grasps and contacts. More importantly, our method
produces smooth interactions even before and after contact. Using a single
trained TOCH model, we quantitatively and qualitatively demonstrate its
usefulness for correcting erroneous sequences from off-the-shelf RGB/RGB-D
hand-object reconstruction methods and transferring grasps across objects
{TOCH}: {S}patio-Temporal Object Correspondence to Hand for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous work focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. The key component is a point-wise object-centric representation which encodes the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for 1) correcting erroneous reconstruction results from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising, and 3) grasp transfer across objects. We will release our code and trained model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous work focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. The key component is a point-wise object-centric representation which encodes the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for 1) correcting erroneous reconstruction results from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising, and 3) grasp transfer across objects. We will release our code and trained model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch
SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation
In recent years, vision transformers have been introduced into face
recognition and analysis and have achieved performance breakthroughs. However,
most previous methods generally train a single model or an ensemble of models
to perform the desired task, which ignores the synergy among different tasks
and fails to achieve improved prediction accuracy, increased data efficiency,
and reduced training time. This paper presents a multi-purpose algorithm for
simultaneous face recognition, facial expression recognition, age estimation,
and face attribute estimation (40 attributes including gender) based on a
single Swin Transformer. Our design, the SwinFace, consists of a single shared
backbone together with a subnet for each set of related tasks. To address the
conflicts among multiple tasks and meet the different demands of tasks, a
Multi-Level Channel Attention (MLCA) module is integrated into each
task-specific analysis subnet, which can adaptively select the features from
optimal levels and channels to perform the desired tasks. Extensive experiments
show that the proposed model has a better understanding of the face and
achieves excellent performance for all tasks. Especially, it achieves 90.97%
accuracy on RAF-DB and 0.22 -error on CLAP2015, which are
state-of-the-art results on facial expression recognition and age estimation
respectively. The code and models will be made publicly available at
https://github.com/lxq1000/SwinFace