20,402 research outputs found
Deep Video Generation, Prediction and Completion of Human Action Sequences
Current deep learning results on video generation are limited while there are
only a few first results on video prediction and no relevant significant
results on video completion. This is due to the severe ill-posedness inherent
in these three problems. In this paper, we focus on human action videos, and
propose a general, two-stage deep framework to generate human action videos
with no constraints or arbitrary number of constraints, which uniformly address
the three problems: video generation given no input frames, video prediction
given the first few frames, and video completion given the first and last
frames. To make the problem tractable, in the first stage we train a deep
generative model that generates a human pose sequence from random noise. In the
second stage, a skeleton-to-image network is trained, which is used to generate
a human action video given the complete human pose sequence generated in the
first stage. By introducing the two-stage strategy, we sidestep the original
ill-posed problems while producing for the first time high-quality video
generation/prediction/completion results of much longer duration. We present
quantitative and qualitative evaluation to show that our two-stage approach
outperforms state-of-the-art methods in video generation, prediction and video
completion. Our video result demonstration can be viewed at
https://iamacewhite.github.io/supp/index.htmlComment: Under review for CVPR 2018. Haoye and Chunyan have equal contributio
Stochastic Dynamics for Video Infilling
In this paper, we introduce a stochastic dynamics video infilling (SDVI)
framework to generate frames between long intervals in a video. Our task
differs from video interpolation which aims to produce transitional frames for
a short interval between every two frames and increase the temporal resolution.
Our task, namely video infilling, however, aims to infill long intervals with
plausible frame sequences. Our framework models the infilling as a constrained
stochastic generation process and sequentially samples dynamics from the
inferred distribution. SDVI consists of two parts: (1) a bi-directional
constraint propagation module to guarantee the spatial-temporal coherence among
frames, (2) a stochastic sampling process to generate dynamics from the
inferred distributions. Experimental results show that SDVI can generate clear
frame sequences with varying contents. Moreover, motions in the generated
sequence are realistic and able to transfer smoothly from the given start frame
to the terminal frame. Our project site is
https://xharlie.github.io/projects/project_sites/SDVI/video_results.htmlComment: Winter Conference on Applications of Computer Vision (WACV 2020
Long-Term Human Video Generation of Multiple Futures Using Poses
Predicting future human behavior from an input human video is a useful task
for applications such as autonomous driving and robotics. While most previous
works predict a single future, multiple futures with different behavior can
potentially occur. Moreover, if the predicted future is too short (e.g., less
than one second), it may not be fully usable by a human or other systems. In
this paper, we propose a novel method for future human pose prediction capable
of predicting multiple long-term futures. This makes the predictions more
suitable for real applications. Also, from the input video and the predicted
human behavior, we generate future videos. First, from an input human video, we
generate sequences of future human poses (i.e., the image coordinates of their
body-joints) via adversarial learning. Adversarial learning suffers from mode
collapse, which makes it difficult to generate a variety of multiple poses. We
solve this problem by utilizing two additional inputs to the generator to make
the outputs diverse, namely, a latent code (to reflect various behaviors) and
an attraction point (to reflect various trajectories). In addition, we generate
long-term future human poses using a novel approach based on unidimensional
convolutional neural networks. Last, we generate an output video based on the
generated poses for visualization. We evaluate the generated future poses and
videos using three criteria (i.e., realism, diversity and accuracy), and show
that our proposed method outperforms other state-of-the-art works
Am I Done? Predicting Action Progress in Videos
In this paper we deal with the problem of predicting action progress in
videos. We argue that this is an extremely important task since it can be
valuable for a wide range of interaction applications. To this end we introduce
a novel approach, named ProgressNet, capable of predicting when an action takes
place in a video, where it is located within the frames, and how far it has
progressed during its execution. To provide a general definition of action
progress, we ground our work in the linguistics literature, borrowing terms and
concepts to understand which actions can be the subject of progress estimation.
As a result, we define a categorization of actions and their phases. Motivated
by the recent success obtained from the interaction of Convolutional and
Recurrent Neural Networks, our model is based on a combination of the Faster
R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate
action progress through time. After introducing two evaluation protocols for
the task at hand, we demonstrate the capability of our model to effectively
predict action progress on the UCF-101 and J-HMDB datasets
HP-GAN: Probabilistic 3D human motion prediction via GAN
Predicting and understanding human motion dynamics has many applications,
such as motion synthesis, augmented reality, security, and autonomous vehicles.
Due to the recent success of generative adversarial networks (GAN), there has
been much interest in probabilistic estimation and synthetic data generation
using deep neural network architectures and learning algorithms.
We propose a novel sequence-to-sequence model for probabilistic human motion
prediction, trained with a modified version of improved Wasserstein generative
adversarial networks (WGAN-GP), in which we use a custom loss function designed
for human motion prediction. Our model, which we call HP-GAN, learns a
probability density function of future human poses conditioned on previous
poses. It predicts multiple sequences of possible future human poses, each from
the same input sequence but a different vector z drawn from a random
distribution. Furthermore, to quantify the quality of the non-deterministic
predictions, we simultaneously train a motion-quality-assessment model that
learns the probability that a given skeleton sequence is a real human motion.
We test our algorithm on two of the largest skeleton datasets: NTURGB-D and
Human3.6M. We train our model on both single and multiple action types. Its
predictive power for long-term motion estimation is demonstrated by generating
multiple plausible futures of more than 30 frames from just 10 frames of input.
We show that most sequences generated from the same input have more than 50\%
probabilities of being judged as a real human sequence. We will release all the
code used in this paper to Github
- …