95,500 research outputs found
Spatio-temporal video autoencoder with differentiable memory
We describe a new spatio-temporal video autoencoder, based on a classic
spatial image autoencoder and a novel nested temporal autoencoder. The temporal
encoder is represented by a differentiable visual memory composed of
convolutional long short-term memory (LSTM) cells that integrate changes over
time. Here we target motion changes and use as temporal decoder a robust
optical flow prediction module together with an image sampler serving as
built-in feedback loop. The architecture is end-to-end differentiable. At each
time step, the system receives as input a video frame, predicts the optical
flow based on the current observation and the LSTM memory state as a dense
transformation map, and applies it to the current frame to generate the next
frame. By minimising the reconstruction error between the predicted next frame
and the corresponding ground truth next frame, we train the whole system to
extract features useful for motion estimation without any supervision effort.
We present one direct application of the proposed framework in
weakly-supervised semantic segmentation of videos through label propagation
using optical flow
Beyond Short Snippets: Deep Networks for Video Classification
Convolutional neural networks (CNNs) have been extensively applied for image
recognition problems giving state-of-the-art results on recognition, detection,
segmentation and retrieval. In this work we propose and evaluate several deep
neural network architectures to combine image information across a video over
longer time periods than previously attempted. We propose two methods capable
of handling full length videos. The first method explores various convolutional
temporal feature pooling architectures, examining the various design choices
which need to be made when adapting a CNN for this task. The second proposed
method explicitly models the video as an ordered sequence of frames. For this
purpose we employ a recurrent neural network that uses Long Short-Term Memory
(LSTM) cells which are connected to the output of the underlying CNN. Our best
networks exhibit significant performance improvements over previously published
results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101
datasets with (88.6% vs. 88.0%) and without additional optical flow information
(82.6% vs. 72.8%)
- …