88,306 research outputs found
Video Imagination from a Single Image with Transformation Generation
In this work, we focus on a challenging task: synthesizing multiple imaginary
videos given a single image. Major problems come from high dimensionality of
pixel space and the ambiguity of potential motions. To overcome those problems,
we propose a new framework that produce imaginary videos by transformation
generation. The generated transformations are applied to the original image in
a novel volumetric merge network to reconstruct frames in imaginary video.
Through sampling different latent variables, our method can output different
imaginary video samples. The framework is trained in an adversarial way with
unsupervised learning. For evaluation, we propose a new assessment metric
. In experiments, we test on 3 datasets varying from synthetic data to
natural scene. Our framework achieves promising performance in image quality
assessment. The visual inspection indicates that it can successfully generate
diverse five-frame videos in acceptable perceptual quality.Comment: 9 pages, 10 figure
Sensor Transformation Attention Networks
Recent work on encoder-decoder models for sequence-to-sequence mapping has
shown that integrating both temporal and spatial attention mechanisms into
neural networks increases the performance of the system substantially. In this
work, we report on the application of an attentional signal not on temporal and
spatial regions of the input, but instead as a method of switching among inputs
themselves. We evaluate the particular role of attentional switching in the
presence of dynamic noise in the sensors, and demonstrate how the attentional
signal responds dynamically to changing noise levels in the environment to
achieve increased performance on both audio and visual tasks in three
commonly-used datasets: TIDIGITS, Wall Street Journal, and GRID. Moreover, the
proposed sensor transformation network architecture naturally introduces a
number of advantages that merit exploration, including ease of adding new
sensors to existing architectures, attentional interpretability, and increased
robustness in a variety of noisy environments not seen during training.
Finally, we demonstrate that the sensor selection attention mechanism of a
model trained only on the small TIDIGITS dataset can be transferred directly to
a pre-existing larger network trained on the Wall Street Journal dataset,
maintaining functionality of switching between sensors to yield a dramatic
reduction of error in the presence of noise.Comment: 8 pages, 5 figures, 3 table
MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics
Long-term human motion can be represented as a series of motion
modes---motion sequences that capture short-term temporal dynamics---with
transitions between them. We leverage this structure and present a novel Motion
Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence
generation. Our model jointly learns a feature embedding for motion modes (that
the motion sequence can be reconstructed from) and a feature transformation
that represents the transition of one motion mode to the next motion mode. Our
model is able to generate multiple diverse and plausible motion sequences in
the future from the same input. We apply our approach to both facial and full
body motion, and demonstrate applications like analogy-based motion transfer
and video synthesis.Comment: Published at ECCV 201
An Unsupervised Algorithm For Learning Lie Group Transformations
We present several theoretical contributions which allow Lie groups to be fit
to high dimensional datasets. Transformation operators are represented in their
eigen-basis, reducing the computational complexity of parameter estimation to
that of training a linear transformation model. A transformation specific
"blurring" operator is introduced that allows inference to escape local minima
via a smoothing of the transformation space. A penalty on traversed manifold
distance is added which encourages the discovery of sparse, minimal distance,
transformations between states. Both learning and inference are demonstrated
using these methods for the full set of affine transformations on natural image
patches. Transformation operators are then trained on natural video sequences.
It is shown that the learned video transformations provide a better description
of inter-frame differences than the standard motion model based on rigid
translation
Predicting the Future with Transformational States
An intelligent observer looks at the world and sees not only what is, but
what is moving and what can be moved. In other words, the observer sees how the
present state of the world can transform in the future. We propose a model that
predicts future images by learning to represent the present state and its
transformation given only a sequence of images. To do so, we introduce an
architecture with a latent state composed of two components designed to capture
(i) the present image state and (ii) the transformation between present and
future states, respectively. We couple this latent state with a recurrent
neural network (RNN) core that predicts future frames by transforming past
states into future states by applying the accumulated state transformation with
a learned operator. We describe how this model can be integrated into an
encoder-decoder convolutional neural network (CNN) architecture that uses
weighted residual connections to integrate representations of the past with
representations of the future. Qualitatively, our approach generates image
sequences that are stable and capture realistic motion over multiple predicted
frames, without requiring adversarial training. Quantitatively, our method
achieves prediction results comparable to state-of-the-art results on standard
image prediction benchmarks (Moving MNIST, KTH, and UCF101).Comment: 24 pages, including supplemen
Switchable Temporal Propagation Network
Videos contain highly redundant information between frames. Such redundancy
has been extensively studied in video compression and encoding, but is less
explored for more advanced video processing. In this paper, we propose a
learnable unified framework for propagating a variety of visual properties of
video images, including but not limited to color, high dynamic range (HDR), and
segmentation information, where the properties are available for only a few
key-frames. Our approach is based on a temporal propagation network (TPN),
which models the transition-related affinity between a pair of frames in a
purely data-driven manner. We theoretically prove two essential factors for
TPN: (a) by regularizing the global transformation matrix as orthogonal, the
"style energy" of the property can be well preserved during propagation; (b)
such regularization can be achieved by the proposed switchable TPN with
bi-directional training on pairs of frames. We apply the switchable TPN to
three tasks: colorizing a gray-scale video based on a few color key-frames,
generating an HDR video from a low dynamic range (LDR) video and a few HDR
frames, and propagating a segmentation mask from the first frame in videos.
Experimental results show that our approach is significantly more accurate and
efficient than the state-of-the-art methods
Deep Learned Frame Prediction for Video Compression
Motion compensation is one of the most essential methods for any video
compression algorithm. Video frame prediction is a task analogous to motion
compensation. In recent years, the task of frame prediction is undertaken by
deep neural networks (DNNs). In this thesis we create a DNN to perform learned
frame prediction and additionally implement a codec that contains our DNN. We
train our network using two methods for two different goals. Firstly we train
our network based on mean square error (MSE) only, aiming to obtain highest
PSNR values at frame prediction and video compression. Secondly we use
adversarial training to produce visually more realistic frame predictions. For
frame prediction, we compare our method with the baseline methods of frame
difference and 16x16 block motion compensation. For video compression we
further include x264 video codec in the comparison. We show that in frame
prediction, adversarial training produces frames that look sharper and more
realistic, compared MSE based training, but in video compression it
consistently performs worse. This proves that even though adversarial training
is useful for generating video frames that are more pleasing to the human eye,
they should not be employed for video compression. Moreover, our network
trained with MSE produces accurate frame predictions, and in quantitative
results, for both tasks, it produces comparable results in all videos and
outperforms other methods on average. More specifically, learned frame
prediction outperforms other methods in terms of rate-distortion performance in
case of high motion video, while the rate-distortion performance of our method
is competitive with that of x264 in low motion video
Robust Online Matrix Factorization for Dynamic Background Subtraction
We propose an effective online background subtraction method, which can be
robustly applied to practical videos that have variations in both foreground
and background. Different from previous methods which often model the
foreground as Gaussian or Laplacian distributions, we model the foreground for
each frame with a specific mixture of Gaussians (MoG) distribution, which is
updated online frame by frame. Particularly, our MoG model in each frame is
regularized by the learned foreground/background knowledge in previous frames.
This makes our online MoG model highly robust, stable and adaptive to practical
foreground and background variations. The proposed model can be formulated as a
concise probabilistic MAP model, which can be readily solved by EM algorithm.
We further embed an affine transformation operator into the proposed model,
which can be automatically adjusted to fit a wide range of video background
transformations and make the method more robust to camera movements. With using
the sub-sampling technique, the proposed method can be accelerated to execute
more than 250 frames per second on average, meeting the requirement of
real-time background subtraction for practical video processing tasks. The
superiority of the proposed method is substantiated by extensive experiments
implemented on synthetic and real videos, as compared with state-of-the-art
online and offline background subtraction methods.Comment: 14 pages, 13 figure
Metric Learning Driven Multi-Task Structured Output Optimization for Robust Keypoint Tracking
As an important and challenging problem in computer vision and graphics,
keypoint-based object tracking is typically formulated in a spatio-temporal
statistical learning framework. However, most existing keypoint trackers are
incapable of effectively modeling and balancing the following three aspects in
a simultaneous manner: temporal model coherence across frames, spatial model
consistency within frames, and discriminative feature construction. To address
this issue, we propose a robust keypoint tracker based on spatio-temporal
multi-task structured output optimization driven by discriminative metric
learning. Consequently, temporal model coherence is characterized by multi-task
structured keypoint model learning over several adjacent frames, while spatial
model consistency is modeled by solving a geometric verification based
structured learning problem. Discriminative feature construction is enabled by
metric learning to ensure the intra-class compactness and inter-class
separability. Finally, the above three modules are simultaneously optimized in
a joint learning scheme. Experimental results have demonstrated the
effectiveness of our tracker.Comment: Accepted by AAAI-1
Functionally Modular and Interpretable Temporal Filtering for Robust Segmentation
The performance of autonomous systems heavily relies on their ability to
generate a robust representation of the environment. Deep neural networks have
greatly improved vision-based perception systems but still fail in challenging
situations, e.g. sensor outages or heavy weather. These failures are often
introduced by data-inherent perturbations, which significantly reduce the
information provided to the perception system. We propose a functionally
modularized temporal filter, which stabilizes an abstract feature
representation of a single-frame segmentation model using information of
previous time steps. Our filter module splits the filter task into multiple
less complex and more interpretable subtasks. The basic structure of the filter
is inspired by a Bayes estimator consisting of a prediction and an update step.
To make the prediction more transparent, we implement it using a geometric
projection and estimate its parameters. This additionally enables the
decomposition of the filter task into static representation filtering and
low-dimensional motion filtering. Our model can cope with missing frames and is
trainable in an end-to-end fashion. Using photorealistic, synthetic video data,
we show the ability of the proposed architecture to overcome data-inherent
perturbations. The experiments especially highlight advantages introduced by an
interpretable and explicit filter module.Comment: In Proceedings of 29th British Machine Vision Conference (BMVC),
Newcastle upon Tyne, UK, 201
- …