15,594 research outputs found
Deep Video Generation, Prediction and Completion of Human Action Sequences
Current deep learning results on video generation are limited while there are
only a few first results on video prediction and no relevant significant
results on video completion. This is due to the severe ill-posedness inherent
in these three problems. In this paper, we focus on human action videos, and
propose a general, two-stage deep framework to generate human action videos
with no constraints or arbitrary number of constraints, which uniformly address
the three problems: video generation given no input frames, video prediction
given the first few frames, and video completion given the first and last
frames. To make the problem tractable, in the first stage we train a deep
generative model that generates a human pose sequence from random noise. In the
second stage, a skeleton-to-image network is trained, which is used to generate
a human action video given the complete human pose sequence generated in the
first stage. By introducing the two-stage strategy, we sidestep the original
ill-posed problems while producing for the first time high-quality video
generation/prediction/completion results of much longer duration. We present
quantitative and qualitative evaluation to show that our two-stage approach
outperforms state-of-the-art methods in video generation, prediction and video
completion. Our video result demonstration can be viewed at
https://iamacewhite.github.io/supp/index.htmlComment: Under review for CVPR 2018. Haoye and Chunyan have equal contributio
Personalized Cinemagraphs using Semantic Understanding and Collaborative Learning
Cinemagraphs are a compelling way to convey dynamic aspects of a scene. In
these media, dynamic and still elements are juxtaposed to create an artistic
and narrative experience. Creating a high-quality, aesthetically pleasing
cinemagraph requires isolating objects in a semantically meaningful way and
then selecting good start times and looping periods for those objects to
minimize visual artifacts (such a tearing). To achieve this, we present a new
technique that uses object recognition and semantic segmentation as part of an
optimization method to automatically create cinemagraphs from videos that are
both visually appealing and semantically meaningful. Given a scene with
multiple objects, there are many cinemagraphs one could create. Our method
evaluates these multiple candidates and presents the best one, as determined by
a model trained to predict human preferences in a collaborative way. We
demonstrate the effectiveness of our approach with multiple results and a user
study.Comment: To appear in ICCV 2017. Total 17 pages including the supplementary
materia
Long-Term Human Video Generation of Multiple Futures Using Poses
Predicting future human behavior from an input human video is a useful task
for applications such as autonomous driving and robotics. While most previous
works predict a single future, multiple futures with different behavior can
potentially occur. Moreover, if the predicted future is too short (e.g., less
than one second), it may not be fully usable by a human or other systems. In
this paper, we propose a novel method for future human pose prediction capable
of predicting multiple long-term futures. This makes the predictions more
suitable for real applications. Also, from the input video and the predicted
human behavior, we generate future videos. First, from an input human video, we
generate sequences of future human poses (i.e., the image coordinates of their
body-joints) via adversarial learning. Adversarial learning suffers from mode
collapse, which makes it difficult to generate a variety of multiple poses. We
solve this problem by utilizing two additional inputs to the generator to make
the outputs diverse, namely, a latent code (to reflect various behaviors) and
an attraction point (to reflect various trajectories). In addition, we generate
long-term future human poses using a novel approach based on unidimensional
convolutional neural networks. Last, we generate an output video based on the
generated poses for visualization. We evaluate the generated future poses and
videos using three criteria (i.e., realism, diversity and accuracy), and show
that our proposed method outperforms other state-of-the-art works
- …