69 research outputs found
Spatio-temporal video autoencoder with differentiable memory
We describe a new spatio-temporal video autoencoder, based on a classic
spatial image autoencoder and a novel nested temporal autoencoder. The temporal
encoder is represented by a differentiable visual memory composed of
convolutional long short-term memory (LSTM) cells that integrate changes over
time. Here we target motion changes and use as temporal decoder a robust
optical flow prediction module together with an image sampler serving as
built-in feedback loop. The architecture is end-to-end differentiable. At each
time step, the system receives as input a video frame, predicts the optical
flow based on the current observation and the LSTM memory state as a dense
transformation map, and applies it to the current frame to generate the next
frame. By minimising the reconstruction error between the predicted next frame
and the corresponding ground truth next frame, we train the whole system to
extract features useful for motion estimation without any supervision effort.
We present one direct application of the proposed framework in
weakly-supervised semantic segmentation of videos through label propagation
using optical flow
Learning to Detect Violent Videos using Convolutional Long Short-Term Memory
Developing a technique for the automatic analysis of surveillance videos in
order to identify the presence of violence is of broad interest. In this work,
we propose a deep neural network for the purpose of recognizing violent videos.
A convolutional neural network is used to extract frame level features from a
video. The frame level features are then aggregated using a variant of the long
short term memory that uses convolutional gates. The convolutional neural
network along with the convolutional long short term memory is capable of
capturing localized spatio-temporal features which enables the analysis of
local motion taking place in the video. We also propose to use adjacent frame
differences as the input to the model thereby forcing it to encode the changes
occurring in the video. The performance of the proposed feature extraction
pipeline is evaluated on three standard benchmark datasets in terms of
recognition accuracy. Comparison of the results obtained with the state of the
art techniques revealed the promising capability of the proposed method in
recognizing violent videos.Comment: Accepted in International Conference on Advanced Video and Signal
based Surveillance(AVSS 2017
Learning Blind Motion Deblurring
As handheld video cameras are now commonplace and available in every
smartphone, images and videos can be recorded almost everywhere at anytime.
However, taking a quick shot frequently yields a blurry result due to unwanted
camera shake during recording or moving objects in the scene. Removing these
artifacts from the blurry recordings is a highly ill-posed problem as neither
the sharp image nor the motion blur kernel is known. Propagating information
between multiple consecutive blurry observations can help restore the desired
sharp image or video. Solutions for blind deconvolution based on neural
networks rely on a massive amount of ground-truth data which is hard to
acquire. In this work, we propose an efficient approach to produce a
significant amount of realistic training data and introduce a novel recurrent
network architecture to deblur frames taking temporal information into account,
which can efficiently handle arbitrary spatial and temporal input sizes. We
demonstrate the versatility of our approach in a comprehensive comparison on a
number of challening real-world examples.Comment: International Conference on Computer Vision (ICCV) (2017
Temporal Interpolation via Motion Field Prediction
Navigated 2D multi-slice dynamic Magnetic Resonance (MR) imaging enables high
contrast 4D MR imaging during free breathing and provides in-vivo observations
for treatment planning and guidance. Navigator slices are vital for
retrospective stacking of 2D data slices in this method. However, they also
prolong the acquisition sessions. Temporal interpolation of navigator slices an
be used to reduce the number of navigator acquisitions without degrading
specificity in stacking. In this work, we propose a convolutional neural
network (CNN) based method for temporal interpolation via motion field
prediction. The proposed formulation incorporates the prior knowledge that a
motion field underlies changes in the image intensities over time. Previous
approaches that interpolate directly in the intensity space are prone to
produce blurry images or even remove structures in the images. Our method
avoids such problems and faithfully preserves the information in the image.
Further, an important advantage of our formulation is that it provides an
unsupervised estimation of bi-directional motion fields. We show that these
motion fields can be used to halve the number of registrations required during
4D reconstruction, thus substantially reducing the reconstruction time.Comment: Submitted to 1st Conference on Medical Imaging with Deep Learning
(MIDL 2018), Amsterdam, The Netherland
- …