3,653 research outputs found
Folded Recurrent Neural Networks for Future Video Prediction
Future video prediction is an ill-posed Computer Vision problem that recently
received much attention. Its main challenges are the high variability in video
content, the propagation of errors through time, and the non-specificity of the
future frames: given a sequence of past frames there is a continuous
distribution of possible futures. This work introduces bijective Gated
Recurrent Units, a double mapping between the input and output of a GRU layer.
This allows for recurrent auto-encoders with state sharing between encoder and
decoder, stratifying the sequence representation and helping to prevent
capacity problems. We show how with this topology only the encoder or decoder
needs to be applied for input encoding and prediction, respectively. This
reduces the computational cost and avoids re-encoding the predictions when
generating a sequence of frames, mitigating the propagation of errors.
Furthermore, it is possible to remove layers from an already trained model,
giving an insight to the role performed by each layer and making the model more
explainable. We evaluate our approach on three video datasets, outperforming
state of the art prediction results on MMNIST and UCF101, and obtaining
competitive results on KTH with 2 and 3 times less memory usage and
computational cost than the best scored approach.Comment: Submitted to European Conference on Computer Visio
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
We study the problem of synthesizing a number of likely future frames from a
single input image. In contrast to traditional methods, which have tackled this
problem in a deterministic or non-parametric way, we propose a novel approach
that models future frames in a probabilistic manner. Our probabilistic model
makes it possible for us to sample and synthesize many possible future frames
from a single input image. Future frame synthesis is challenging, as it
involves low- and high-level image and motion understanding. We propose a novel
network structure, namely a Cross Convolutional Network to aid in synthesizing
future frames; this network structure encodes image and motion information as
feature maps and convolutional kernels, respectively. In experiments, our model
performs well on synthetic data, such as 2D shapes and animated game sprites,
as well as on real-wold videos. We also show that our model can be applied to
tasks such as visual analogy-making, and present an analysis of the learned
network representations.Comment: The first two authors contributed equally to this wor
- …