309 research outputs found
Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting
Hand trajectory forecasting from egocentric views is crucial for enabling a
prompt understanding of human intentions when interacting with AR/VR systems.
However, existing methods handle this problem in a 2D image space which is
inadequate for 3D real-world applications. In this paper, we set up an
egocentric 3D hand trajectory forecasting task that aims to predict hand
trajectories in a 3D space from early observed RGB videos in a first-person
view. To fulfill this goal, we propose an uncertainty-aware state space
Transformer (USST) that takes the merits of the attention mechanism and
aleatoric uncertainty within the framework of the classical state-space model.
The model can be further enhanced by the velocity constraint and visual prompt
tuning (VPT) on large vision transformers. Moreover, we develop an annotation
workflow to collect 3D hand trajectories with high quality. Experimental
results on H2O and EgoPAT3D datasets demonstrate the superiority of USST for
both 2D and 3D trajectory forecasting. The code and datasets are publicly
released: https://actionlab-cv.github.io/EgoHandTrajPred.Comment: ICCV 2023 Accepted (Camera Ready
Embodied Scene-aware Human Pose Estimation
We propose embodied scene-aware human pose estimation where we estimate 3D
poses based on a simulated agent's proprioception and scene awareness, along
with external third-person observations. Unlike prior methods that often resort
to multistage optimization, non-causal inference, and complex contact modeling
to estimate human pose and human scene interactions, our method is one stage,
causal, and recovers global 3D human poses in a simulated environment. Since 2D
third-person observations are coupled with the camera pose, we propose to
disentangle the camera pose and use a multi-step projection gradient defined in
the global coordinate frame as the movement cue for our embodied agent.
Leveraging a physics simulation and prescanned scenes (e.g., 3D mesh), we
simulate our agent in everyday environments (libraries, offices, bedrooms,
etc.) and equip our agent with environmental sensors to intelligently navigate
and interact with scene geometries. Our method also relies only on 2D keypoints
and can be trained on synthetic datasets derived from popular human motion
databases. To evaluate, we use the popular H36M and PROX datasets and, for the
first time, achieve a success rate of 96.7% on the challenging PROX dataset
without ever using PROX motion sequences for training.Comment: Project website: https://embodiedscene.github.io/embodiedpose/
Zhengyi Luo and Shun Iwase contributed equall
- …