1,558 research outputs found
PreCNet: Next Frame Video Prediction Based on Predictive Coding
Predictive coding, currently a highly influential theory in neuroscience, has
not been widely adopted in machine learning yet. In this work, we transform the
seminal model of Rao and Ballard (1999) into a modern deep learning framework
while remaining maximally faithful to the original schema. The resulting
network we propose (PreCNet) is tested on a widely used next frame video
prediction benchmark, which consists of images from an urban environment
recorded from a car-mounted camera. On this benchmark (training: 41k images
from KITTI dataset; testing: Caltech Pedestrian dataset), we achieve to our
knowledge the best performance to date when measured with the Structural
Similarity Index (SSIM). Performance on all measures was further improved when
a larger training set (2M images from BDD100k), pointing to the limitations of
the KITTI training set. This work demonstrates that an architecture carefully
based in a neuroscience model, without being explicitly tailored to the task at
hand, can exhibit unprecedented performance
MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction
The mainstream of the existing approaches for video prediction builds up
their models based on a Single-In-Single-Out (SISO) architecture, which takes
the current frame as input to predict the next frame in a recursive manner.
This way often leads to severe performance degradation when they try to
extrapolate a longer period of future, thus limiting the practical use of the
prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that
outputs all the future frames at one shot naturally breaks the recursive manner
and therefore prevents error accumulation. However, only a few MIMO models for
video prediction are proposed and they only achieve inferior performance due to
the date. The real strength of the MIMO model in this area is not well noticed
and is largely under-explored. Motivated by that, we conduct a comprehensive
investigation in this paper to thoroughly exploit how far a simple MIMO
architecture can go. Surprisingly, our empirical studies reveal that a simple
MIMO model can outperform the state-of-the-art work with a large margin much
more than expected, especially in dealing with longterm error accumulation.
After exploring a number of ways and designs, we propose a new MIMO
architecture based on extending the pure Transformer with local spatio-temporal
blocks and a new multi-output decoder, namely MIMO-VP, to establish a new
standard in video prediction. We evaluate our model in four highly competitive
benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments
show that our model wins 1st place on all the benchmarks with remarkable
performance gains and surpasses the best SISO model in all aspects including
efficiency, quantity, and quality. We believe our model can serve as a new
baseline to facilitate the future research of video prediction tasks. The code
will be released
Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction
Leveraging physical knowledge described by partial differential equations
(PDEs) is an appealing way to improve unsupervised video prediction methods.
Since physics is too restrictive for describing the full visual content of
generic videos, we introduce PhyDNet, a two-branch deep architecture, which
explicitly disentangles PDE dynamics from unknown complementary information. A
second contribution is to propose a new recurrent physical cell (PhyCell),
inspired from data assimilation techniques, for performing PDE-constrained
prediction in latent space. Extensive experiments conducted on four various
datasets show the ability of PhyDNet to outperform state-of-the-art methods.
Ablation studies also highlight the important gain brought out by both
disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet
presents interesting features for dealing with missing data and long-term
forecasting
- …