27 research outputs found
Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision
We tackle the problem of Human Locomotion Forecasting, a task for jointly
predicting the spatial positions of several keypoints on the human body in the
near future under an egocentric setting. In contrast to the previous work that
aims to solve either the task of pose prediction or trajectory forecasting in
isolation, we propose a framework to unify the two problems and address the
practically useful task of pedestrian locomotion prediction in the wild. Among
the major challenges in solving this task is the scarcity of annotated
egocentric video datasets with dense annotations for pose, depth, or egomotion.
To surmount this difficulty, we use state-of-the-art models to generate (noisy)
annotations and propose robust forecasting models that can learn from this
noisy supervision. We present a method to disentangle the overall pedestrian
motion into easier to learn subparts by utilizing a pose completion and a
decomposition module. The completion module fills in the missing key-point
annotations and the decomposition module breaks the cleaned locomotion down to
global (trajectory) and local (pose keypoint movements). Further, with Quasi
RNN as our backbone, we propose a novel hierarchical trajectory forecasting
network that utilizes low-level vision domain specific signals like egomotion
and depth to predict the global trajectory. Our method leads to
state-of-the-art results for the prediction of human locomotion in the
egocentric view. Project pade: https://karttikeya.github.io/publication/plf/Comment: Accepted to WACV 2020 (Oral
Indoor future person localization from an egocentric wearable camera
Accurate prediction of future person location and movement trajectory from an egocentric wearable camera can benefit a wide range of applications, such as assisting visually impaired people in navigation, and the development of mobility assistance for people with disability. In this work, a new egocentric dataset was constructed using a wearable camera, with 8,250 short clips of a targeted person either walking 1) toward, 2) away, or 3) across the camera wearer in indoor environments, or 4) staying still in the scene, and 13,817 person bounding boxes were manually labelled. Apart from the bounding boxes, the dataset also contains the estimated pose of the targeted person as well as the IMU signal of the wearable camera at each time point. An LSTM-based encoder-decoder framework was designed to predict the future location and movement trajectory of the targeted person in this egocentric setting. Extensive experiments have been conducted on the new dataset, and have shown that the proposed method is able to reliably and better predict future person location and trajectory in egocentric videos captured by the wearable camera compared to three baselines
Toward Reliable Human Pose Forecasting with Uncertainty
Recently, there has been an arms race of pose forecasting methods aimed at
solving the spatio-temporal task of predicting a sequence of future 3D poses of
a person given a sequence of past observed ones. However, the lack of unified
benchmarks and limited uncertainty analysis have hindered progress in the
field. To address this, we first develop an open-source library for human pose
forecasting, featuring multiple models, datasets, and standardized evaluation
metrics, with the aim of promoting research and moving toward a unified and
fair evaluation. Second, we devise two types of uncertainty in the problem to
increase performance and convey better trust: 1) we propose a method for
modeling aleatoric uncertainty by using uncertainty priors to inject knowledge
about the behavior of uncertainty. This focuses the capacity of the model in
the direction of more meaningful supervision while reducing the number of
learned parameters and improving stability; 2) we introduce a novel approach
for quantifying the epistemic uncertainty of any model through clustering and
measuring the entropy of its assignments. Our experiments demonstrate up to
improvements in accuracy and better performance in uncertainty
estimation
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild
Joint forecasting of human trajectory and pose dynamics is a fundamental
building block of various applications ranging from robotics and autonomous
driving to surveillance systems. Predicting body dynamics requires capturing
subtle information embedded in the humans' interactions with each other and
with the objects present in the scene. In this paper, we propose a novel
TRajectory and POse Dynamics (nicknamed TRiPOD) method based on graph
attentional networks to model the human-human and human-object interactions
both in the input space and the output space (decoded future output). The model
is supplemented by a message passing interface over the graphs to fuse these
different levels of interactions efficiently. Furthermore, to incorporate a
real-world challenge, we propound to learn an indicator representing whether an
estimated body joint is visible/invisible at each frame, e.g. due to occlusion
or being outside the sensor field of view. Finally, we introduce a new
benchmark for this joint task based on two challenging datasets (PoseTrack and
3DPW) and propose evaluation metrics to measure the effectiveness of
predictions in the global space, even when there are invisible cases of joints.
Our evaluation shows that TRiPOD outperforms all prior work and
state-of-the-art specifically designed for each of the trajectory and pose
forecasting tasks
A Quadruple Diffusion Convolutional Recurrent Network for Human Motion Prediction
Recurrent neural network (RNN) has become popular for human motion prediction thanks to its ability to capture temporal dependencies. However, it has limited capacity in modeling the complex spatial relationship in the human skeletal structure. In this work, we present a novel diffusion convolutional recurrent predictor for spatial and temporal movement forecasting, with multi-step random walks traversing bidirectionally along an adaptive graph to model interdependency among body joints. In the temporal domain, existing methods rely on a single forward predictor with the produced motion deflecting to the drift route, which leads to error accumulations over time. We propose to supplement the forward predictor with a forward discriminator to alleviate such motion drift in the long term under adversarial training. The solution is further enhanced by a backward predictor and a backward discriminator to effectively reduce the error, such that the system can also look into the past to improve the prediction at early frames. The two-way spatial diffusion convolutions and two-way temporal predictors together form a quadruple network. Furthermore, we train our framework by modeling the velocity from observed motion dynamics instead of static poses to predict future movements that effectively reduces the discontinuity problem at early prediction. Our method outperforms the state of the arts on both 3D and 2D datasets, including the Human3.6M, CMU Motion Capture and Penn Action datasets. The results also show that our method correctly predicts both high-dynamic and low-dynamic moving trends with less motion drift
Towards Explainable, Privacy-Preserved Human-Motion Affect Recognition
Human motion characteristics are used to monitor the progression of neurological diseases and mood disorders. Since perceptions of emotions are also interleaved with body posture and movements, emotion recognition from human gait can be used to quantitatively monitor mood changes. Many existing solutions often use shallow machine learning models with raw positional data or manually extracted features to achieve this. However, gait is composed of many highly expressive characteristics that can be used to identify human subjects, and most solutions fail to address this, disregarding the subject's privacy. This work introduces a novel deep neural network architecture to disentangle human emotions and biometrics. In particular, we propose a cross-subject transfer learning technique for training a multi-encoder autoencoder deep neural network to learn disentangled latent representations of human motion features. By disentangling subject biometrics from the gait data, we show that the subject's privacy is preserved while the affect recognition performance outperforms traditional methods. Furthermore, we exploit Guided Grad-CAM to provide global explanations of the model's decision across gait cycles. We evaluate the effectiveness of our method to existing methods at recognizing emotions using both 3D temporal joint signals and manually extracted features. We also show that this data can easily be exploited to expose a subject's identity. Our method shows up to 7% improvement and highlights the joints with the most significant influence across the average gait cycle