5,867 research outputs found
From Here to There: Video Inbetweening Using Direct 3D Convolutions
We consider the problem of generating plausible and diverse video sequences,
when we are only given a start and an end frame. This task is also known as
inbetweening, and it belongs to the broader area of stochastic video
generation, which is generally approached by means of recurrent neural networks
(RNN). In this paper, we propose instead a fully convolutional model to
generate video sequences directly in the pixel domain. We first obtain a latent
video representation using a stochastic fusion mechanism that learns how to
incorporate information from the start and end frames. Our model learns to
produce such latent representation by progressively increasing the temporal
resolution, and then decode in the spatiotemporal domain using 3D convolutions.
The model is trained end-to-end by minimizing an adversarial loss. Experiments
on several widely-used benchmark datasets show that it is able to generate
meaningful and diverse in-between video sequences, according to both
quantitative and qualitative evaluations
Deep Learned Frame Prediction for Video Compression
Motion compensation is one of the most essential methods for any video
compression algorithm. Video frame prediction is a task analogous to motion
compensation. In recent years, the task of frame prediction is undertaken by
deep neural networks (DNNs). In this thesis we create a DNN to perform learned
frame prediction and additionally implement a codec that contains our DNN. We
train our network using two methods for two different goals. Firstly we train
our network based on mean square error (MSE) only, aiming to obtain highest
PSNR values at frame prediction and video compression. Secondly we use
adversarial training to produce visually more realistic frame predictions. For
frame prediction, we compare our method with the baseline methods of frame
difference and 16x16 block motion compensation. For video compression we
further include x264 video codec in the comparison. We show that in frame
prediction, adversarial training produces frames that look sharper and more
realistic, compared MSE based training, but in video compression it
consistently performs worse. This proves that even though adversarial training
is useful for generating video frames that are more pleasing to the human eye,
they should not be employed for video compression. Moreover, our network
trained with MSE produces accurate frame predictions, and in quantitative
results, for both tasks, it produces comparable results in all videos and
outperforms other methods on average. More specifically, learned frame
prediction outperforms other methods in terms of rate-distortion performance in
case of high motion video, while the rate-distortion performance of our method
is competitive with that of x264 in low motion video
Hierarchical Long-term Video Prediction without Supervision
Much of recent research has been devoted to video prediction and generation,
yet most of the previous works have demonstrated only limited success in
generating videos on short-term horizons. The hierarchical video prediction
method by Villegas et al. (2017) is an example of a state-of-the-art method for
long-term video prediction, but their method is limited because it requires
ground truth annotation of high-level structures (e.g., human joint landmarks)
at training time. Our network encodes the input frame, predicts a high-level
encoding into the future, and then a decoder with access to the first frame
produces the predicted image from the predicted encoding. The decoder also
produces a mask that outlines the predicted foreground object (e.g., person) as
a by-product. Unlike Villegas et al. (2017), we develop a novel training method
that jointly trains the encoder, the predictor, and the decoder together
without highlevel supervision; we further improve upon this by using an
adversarial loss in the feature space to train the predictor. Our method can
predict about 20 seconds into the future and provides better results compared
to Denton and Fergus (2018) and Finn et al. (2016) on the Human 3.6M dataset.Comment: International Conference on Machine Learning (ICML) 201
Towards Accurate Generative Models of Video: A New Metric & Challenges
Recent advances in deep generative models have lead to remarkable progress in
synthesizing high quality images. Following their successful application in
image processing and representation learning, an important next step is to
consider videos. Learning generative models of video is a much harder task,
requiring a model to capture the temporal dynamics of a scene, in addition to
the visual presentation of objects. While recent attempts at formulating
generative models of video have had some success, current progress is hampered
by (1) the lack of qualitative metrics that consider visual quality, temporal
coherence, and diversity of samples, and (2) the wide gap between purely
synthetic video data sets and challenging real-world data sets in terms of
complexity. To this extent we propose Fr\'{e}chet Video Distance (FVD), a new
metric for generative models of video, and StarCraft 2 Videos (SCV), a
benchmark of game play from custom starcraft 2 scenarios that challenge the
current capabilities of generative models of video. We contribute a large-scale
human study, which confirms that FVD correlates well with qualitative human
judgment of generated videos, and provide initial benchmark results on SCV
The Pose Knows: Video Forecasting by Generating Pose Futures
Current approaches in video forecasting attempt to generate videos directly
in pixel space using Generative Adversarial Networks (GANs) or Variational
Autoencoders (VAEs). However, since these approaches try to model all the
structure and scene dynamics at once, in unconstrained settings they often
generate uninterpretable results. Our insight is to model the forecasting
problem at a higher level of abstraction. Specifically, we exploit human pose
detectors as a free source of supervision and break the video forecasting
problem into two discrete steps. First we explicitly model the high level
structure of active objects in the scene---humans---and use a VAE to model the
possible future movements of humans in the pose space. We then use the future
poses generated as conditional information to a GAN to predict the future
frames of the video in pixel space. By using the structured space of pose as an
intermediate representation, we sidestep the problems that GANs have in
generating video pixels directly. We show through quantitative and qualitative
evaluation that our method outperforms state-of-the-art methods for video
prediction.Comment: Project Website: http://www.cs.cmu.edu/~jcwalker/POS/POS.htm
Multi-View Frame Reconstruction with Conditional GAN
Multi-view frame reconstruction is an important problem particularly when
multiple frames are missing and past and future frames within the camera are
far apart from the missing ones. Realistic coherent frames can still be
reconstructed using corresponding frames from other overlapping cameras. We
propose an adversarial approach to learn the spatio-temporal representation of
the missing frame using conditional Generative Adversarial Network (cGAN). The
conditional input to each cGAN is the preceding or following frames within the
camera or the corresponding frames in other overlapping cameras, all of which
are merged together using a weighted average. Representations learned from
frames within the camera are given more weight compared to the ones learned
from other cameras when they are close to the missing frames and vice versa.
Experiments on two challenging datasets demonstrate that our framework produces
comparable results with the state-of-the-art reconstruction method in a single
camera and achieves promising performance in multi-camera scenario.Comment: 5 pages, 4 figures, 3 tables, Accepted at IEEE Global Conference on
Signal and Information Processing, 201
Time-Agnostic Prediction: Predicting Predictable Video Frames
Prediction is arguably one of the most basic functions of an intelligent
system. In general, the problem of predicting events in the future or between
two waypoints is exceedingly difficult. However, most phenomena naturally pass
through relatively predictable bottlenecks---while we cannot predict the
precise trajectory of a robot arm between being at rest and holding an object
up, we can be certain that it must have picked the object up. To exploit this,
we decouple visual prediction from a rigid notion of time. While conventional
approaches predict frames at regularly spaced temporal intervals, our
time-agnostic predictors (TAP) are not tied to specific times so that they may
instead discover predictable "bottleneck" frames no matter when they occur. We
evaluate our approach for future and intermediate frame prediction across three
robotic manipulation tasks. Our predictions are not only of higher visual
quality, but also correspond to coherent semantic subgoals in temporally
extended tasks.Comment: 8 pages, plus appendice
Consistent Generative Query Networks
Stochastic video prediction models take in a sequence of image frames, and
generate a sequence of consecutive future image frames. These models typically
generate future frames in an autoregressive fashion, which is slow and requires
the input and output frames to be consecutive. We introduce a model that
overcomes these drawbacks by generating a latent representation from an
arbitrary set of frames that can then be used to simultaneously and efficiently
sample temporally consistent frames at arbitrary time-points. For example, our
model can "jump" and directly sample frames at the end of the video, without
sampling intermediate frames. Synthetic video evaluations confirm substantial
gains in speed and functionality without loss in fidelity. We also apply our
framework to a 3D scene reconstruction dataset. Here, our model is conditioned
on camera location and can sample consistent sets of images for what an
occluded region of a 3D scene might look like, even if there are multiple
possibilities for what that region might contain. Reconstructions and videos
are available at https://bit.ly/2O4Pc4R
Future Frame Prediction for Anomaly Detection -- A New Baseline
Anomaly detection in videos refers to the identification of events that do
not conform to expected behavior. However, almost all existing methods tackle
the problem by minimizing the reconstruction errors of training data, which
cannot guarantee a larger reconstruction error for an abnormal event. In this
paper, we propose to tackle the anomaly detection problem within a video
prediction framework. To the best of our knowledge, this is the first work that
leverages the difference between a predicted future frame and its ground truth
to detect an abnormal event. To predict a future frame with higher quality for
normal events, other than the commonly used appearance (spatial) constraints on
intensity and gradient, we also introduce a motion (temporal) constraint in
video prediction by enforcing the optical flow between predicted frames and
ground truth frames to be consistent, and this is the first work that
introduces a temporal constraint into the video prediction task. Such spatial
and motion constraints facilitate the future frame prediction for normal
events, and consequently facilitate to identify those abnormal events that do
not conform the expectation. Extensive experiments on both a toy dataset and
some publicly available datasets validate the effectiveness of our method in
terms of robustness to the uncertainty in normal events and the sensitivity to
abnormal events.Comment: IEEE Conference on Computer Vision and Pattern Recognition 201
Video-to-Video Synthesis
We study the problem of video-to-video synthesis, whose goal is to learn a
mapping function from an input source video (e.g., a sequence of semantic
segmentation masks) to an output photorealistic video that precisely depicts
the content of the source video. While its image counterpart, the
image-to-image synthesis problem, is a popular topic, the video-to-video
synthesis problem is less explored in the literature. Without understanding
temporal dynamics, directly applying existing image synthesis approaches to an
input video often results in temporally incoherent videos of low visual
quality. In this paper, we propose a novel video-to-video synthesis approach
under the generative adversarial learning framework. Through carefully-designed
generator and discriminator architectures, coupled with a spatio-temporal
adversarial objective, we achieve high-resolution, photorealistic, temporally
coherent video results on a diverse set of input formats including segmentation
masks, sketches, and poses. Experiments on multiple benchmarks show the
advantage of our method compared to strong baselines. In particular, our model
is capable of synthesizing 2K resolution videos of street scenes up to 30
seconds long, which significantly advances the state-of-the-art of video
synthesis. Finally, we apply our approach to future video prediction,
outperforming several state-of-the-art competing systems.Comment: In NeurIPS, 2018. Code, models, and more results are available at
https://github.com/NVIDIA/vid2vi
- …