609 research outputs found
Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting
Hand trajectory forecasting from egocentric views is crucial for enabling a
prompt understanding of human intentions when interacting with AR/VR systems.
However, existing methods handle this problem in a 2D image space which is
inadequate for 3D real-world applications. In this paper, we set up an
egocentric 3D hand trajectory forecasting task that aims to predict hand
trajectories in a 3D space from early observed RGB videos in a first-person
view. To fulfill this goal, we propose an uncertainty-aware state space
Transformer (USST) that takes the merits of the attention mechanism and
aleatoric uncertainty within the framework of the classical state-space model.
The model can be further enhanced by the velocity constraint and visual prompt
tuning (VPT) on large vision transformers. Moreover, we develop an annotation
workflow to collect 3D hand trajectories with high quality. Experimental
results on H2O and EgoPAT3D datasets demonstrate the superiority of USST for
both 2D and 3D trajectory forecasting. The code and datasets are publicly
released: https://actionlab-cv.github.io/EgoHandTrajPred.Comment: ICCV 2023 Accepted (Camera Ready
Enhancing egocentric 3D pose estimation with third person views
© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-NDWe propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera. The main technical contribution consists of leveraging high-level features linking first- and third-views in a joint embedding space. To learn such embedding space we introduce First2Third-Pose, a new paired synchronized dataset of nearly 2000 videos depicting human activities captured from both first- and third-view perspectives. We explicitly consider spatial- and motion-domain features, combined using a semi-Siamese architecture trained in a self-supervised fashion. Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos, with no need to perform any sort of domain adaptation or knowledge of camera parameters. An extensive evalu- ation demonstrates that we achieve significant improvement in egocentric 3D body pose estimation per- formance on two unconstrained datasets, over three supervised state-of-the-art approaches. The collected dataset and pre-trained model are available for research purposes.This work has been partially supported by projects PID2020-120 049RB-I00 and PID2019-110977GA-I00 funded by MCIN/ AEI/10.13039/501100 011033 and by the “European Union NextGener-ationEU/PRTR”, as well as by grant RYC-2017-22563 funded by MCIN/ AEI /10.13039/501100 011033 and by “ESF Investing in your future”, and network RED2018-102511-T funded by MCIN/ AEIPeer ReviewedPostprint (published version
Skeleton2Humanoid: Animating Simulated Characters for Physically-plausible Motion In-betweening
Human motion synthesis is a long-standing problem with various applications
in digital twins and the Metaverse. However, modern deep learning based motion
synthesis approaches barely consider the physical plausibility of synthesized
motions and consequently they usually produce unrealistic human motions. In
order to solve this problem, we propose a system ``Skeleton2Humanoid'' which
performs physics-oriented motion correction at test time by regularizing
synthesized skeleton motions in a physics simulator. Concretely, our system
consists of three sequential stages: (I) test time motion synthesis network
adaptation, (II) skeleton to humanoid matching and (III) motion imitation based
on reinforcement learning (RL). Stage I introduces a test time adaptation
strategy, which improves the physical plausibility of synthesized human
skeleton motions by optimizing skeleton joint locations. Stage II performs an
analytical inverse kinematics strategy, which converts the optimized human
skeleton motions to humanoid robot motions in a physics simulator, then the
converted humanoid robot motions can be served as reference motions for the RL
policy to imitate. Stage III introduces a curriculum residual force control
policy, which drives the humanoid robot to mimic complex converted reference
motions in accordance with the physical law. We verify our system on a typical
human motion synthesis task, motion-in-betweening. Experiments on the
challenging LaFAN1 dataset show our system can outperform prior methods
significantly in terms of both physical plausibility and accuracy. Code will be
released for research purposes at:
https://github.com/michaelliyunhao/Skeleton2HumanoidComment: Accepted by ACMMM202
SMPL-Based 3D Pedestrian Pose Prediction
Modeling human motion is a long-standing problem in computer vision. The rapid development of deep learning technologies for computer vision problems resulted in increased attention in the area of pose prediction due to its vital role in a multitude of applications, for example, behavior analysis, autonomous vehicles, and visual surveillance. In 3D pedestrian pose prediction, joint-rotation-based pose representation is extensively used due to the unconstrained degree of freedom for each joint and its ability to regress the 3D statistical wireframe. However, all the existing joint-rotation-based pose prediction approaches ignore the centrality of the distinct pose parameter components and are consequently prone to suffer from error accumulation along the kinematic chain, which results in unnatural human poses. In joint-rotation-based pose prediction, Skinned Multi-Person Linear (SMPL) parameters are widely used to represent pedestrian pose. In this work, a novel SMPL-based pose prediction network is proposed to address the centrality of each SMPL component by distributing the network weights among them. Furthermore, to constrain the network to generate only plausible human poses, an adversarial training approach is employed. The effectiveness of the proposed network is evaluated using the PedX and BEHAVE datasets. The proposed approach significantly outperforms state-of-the-art methods with improved prediction accuracy and generates plausible human pose predictions
- …