2,497 research outputs found
Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos
Human behavior understanding in videos is a complex, still unsolved problem
and requires to accurately model motion at both the local (pixel-wise dense
prediction) and global (aggregation of motion cues) levels. Current approaches
based on supervised learning require large amounts of annotated data, whose
scarce availability is one of the main limiting factors to the development of
general solutions. Unsupervised learning can instead leverage the vast amount
of videos available on the web and it is a promising solution for overcoming
the existing limitations. In this paper, we propose an adversarial GAN-based
framework that learns video representations and dynamics through a
self-supervision mechanism in order to perform dense and global prediction in
videos. Our approach synthesizes videos by 1) factorizing the process into the
generation of static visual content and motion, 2) learning a suitable
representation of a motion latent space in order to enforce spatio-temporal
coherency of object trajectories, and 3) incorporating motion estimation and
pixel-wise dense prediction into the training procedure. Self-supervision is
enforced by using motion masks produced by the generator, as a co-product of
its generation process, to supervise the discriminator network in performing
dense prediction. Performance evaluation, carried out on standard benchmarks,
shows that our approach is able to learn, in an unsupervised way, both local
and global video dynamics. The learned representations, then, support the
training of video object segmentation methods with sensibly less (about 50%)
annotations, giving performance comparable to the state of the art.
Furthermore, the proposed method achieves promising performance in generating
realistic videos, outperforming state-of-the-art approaches especially on
motion-related metrics
MoCoGAN: Decomposing Motion and Content for Video Generation
Visual signals in a video can be divided into content and motion. While
content specifies which objects are in the video, motion describes their
dynamics. Based on this prior, we propose the Motion and Content decomposed
Generative Adversarial Network (MoCoGAN) framework for video generation. The
proposed framework generates a video by mapping a sequence of random vectors to
a sequence of video frames. Each random vector consists of a content part and a
motion part. While the content part is kept fixed, the motion part is realized
as a stochastic process. To learn motion and content decomposition in an
unsupervised manner, we introduce a novel adversarial learning scheme utilizing
both image and video discriminators. Extensive experimental results on several
challenging datasets with qualitative and quantitative comparison to the
state-of-the-art approaches, verify effectiveness of the proposed framework. In
addition, we show that MoCoGAN allows one to generate videos with same content
but different motion as well as videos with different content and same motion
Improving Video Generation for Multi-functional Applications
In this paper, we aim to improve the state-of-the-art video generative
adversarial networks (GANs) with a view towards multi-functional applications.
Our improved video GAN model does not separate foreground from background nor
dynamic from static patterns, but learns to generate the entire video clip
conjointly. Our model can thus be trained to generate - and learn from - a
broad set of videos with no restriction. This is achieved by designing a robust
one-stream video generation architecture with an extension of the
state-of-the-art Wasserstein GAN framework that allows for better convergence.
The experimental results show that our improved video GAN model outperforms
state-of-theart video generative models on multiple challenging datasets.
Furthermore, we demonstrate the superiority of our model by successfully
extending it to three challenging problems: video colorization, video
inpainting, and future prediction. To the best of our knowledge, this is the
first work using GANs to colorize and inpaint video clips
Dual Motion GAN for Future-Flow Embedded Video Prediction
Future frame prediction in videos is a promising avenue for unsupervised
video representation learning. Video frames are naturally generated by the
inherent pixel flows from preceding frames based on the appearance and motion
dynamics in the video. However, existing methods focus on directly
hallucinating pixel values, resulting in blurry predictions. In this paper, we
develop a dual motion Generative Adversarial Net (GAN) architecture, which
learns to explicitly enforce future-frame predictions to be consistent with the
pixel-wise flows in the video through a dual-learning mechanism. The primal
future-frame prediction and dual future-flow prediction form a closed loop,
generating informative feedback signals to each other for better video
prediction. To make both synthesized future frames and flows indistinguishable
from reality, a dual adversarial training method is proposed to ensure that the
future-flow prediction is able to help infer realistic future-frames, while the
future-frame prediction in turn leads to realistic optical flows. Our dual
motion GAN also handles natural motion uncertainty in different pixel locations
with a new probabilistic motion encoder, which is based on variational
autoencoders. Extensive experiments demonstrate that the proposed dual motion
GAN significantly outperforms state-of-the-art approaches on synthesizing new
video frames and predicting future flows. Our model generalizes well across
diverse visual scenes and shows superiority in unsupervised video
representation learning.Comment: ICCV 17 camera read
Video-to-Video Synthesis
We study the problem of video-to-video synthesis, whose goal is to learn a
mapping function from an input source video (e.g., a sequence of semantic
segmentation masks) to an output photorealistic video that precisely depicts
the content of the source video. While its image counterpart, the
image-to-image synthesis problem, is a popular topic, the video-to-video
synthesis problem is less explored in the literature. Without understanding
temporal dynamics, directly applying existing image synthesis approaches to an
input video often results in temporally incoherent videos of low visual
quality. In this paper, we propose a novel video-to-video synthesis approach
under the generative adversarial learning framework. Through carefully-designed
generator and discriminator architectures, coupled with a spatio-temporal
adversarial objective, we achieve high-resolution, photorealistic, temporally
coherent video results on a diverse set of input formats including segmentation
masks, sketches, and poses. Experiments on multiple benchmarks show the
advantage of our method compared to strong baselines. In particular, our model
is capable of synthesizing 2K resolution videos of street scenes up to 30
seconds long, which significantly advances the state-of-the-art of video
synthesis. Finally, we apply our approach to future video prediction,
outperforming several state-of-the-art competing systems.Comment: In NeurIPS, 2018. Code, models, and more results are available at
https://github.com/NVIDIA/vid2vi
Video Imagination from a Single Image with Transformation Generation
In this work, we focus on a challenging task: synthesizing multiple imaginary
videos given a single image. Major problems come from high dimensionality of
pixel space and the ambiguity of potential motions. To overcome those problems,
we propose a new framework that produce imaginary videos by transformation
generation. The generated transformations are applied to the original image in
a novel volumetric merge network to reconstruct frames in imaginary video.
Through sampling different latent variables, our method can output different
imaginary video samples. The framework is trained in an adversarial way with
unsupervised learning. For evaluation, we propose a new assessment metric
. In experiments, we test on 3 datasets varying from synthetic data to
natural scene. Our framework achieves promising performance in image quality
assessment. The visual inspection indicates that it can successfully generate
diverse five-frame videos in acceptable perceptual quality.Comment: 9 pages, 10 figure
Visual Forecasting by Imitating Dynamics in Natural Sequences
We introduce a general framework for visual forecasting, which directly
imitates visual sequences without additional supervision. As a result, our
model can be applied at several semantic levels and does not require any domain
knowledge or handcrafted features. We achieve this by formulating visual
forecasting as an inverse reinforcement learning (IRL) problem, and directly
imitate the dynamics in natural sequences from their raw pixel values. The key
challenge is the high-dimensional and continuous state-action space that
prohibits the application of previous IRL algorithms. We address this
computational bottleneck by extending recent progress in model-free imitation
with trainable deep feature representations, which (1) bypasses the exhaustive
state-action pair visits in dynamic programming by using a dual formulation and
(2) avoids explicit state sampling at gradient computation using a deep feature
reparametrization. This allows us to apply IRL at scale and directly imitate
the dynamics in high-dimensional continuous visual sequences from the raw pixel
values. We evaluate our approach at three different level-of-abstraction, from
low level pixels to higher level semantics: future frame generation, action
anticipation, visual story forecasting. At all levels, our approach outperforms
existing methods.Comment: 10 pages, 9 figures, accepted to ICCV 201
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Large-scale labeled data are generally required to train deep neural networks
in order to obtain better performance in visual feature learning from images or
videos for computer vision applications. To avoid extensive cost of collecting
and annotating large-scale datasets, as a subset of unsupervised learning
methods, self-supervised learning methods are proposed to learn general image
and video features from large-scale unlabeled data without using any
human-annotated labels. This paper provides an extensive review of deep
learning-based self-supervised general visual feature learning methods from
images or videos. First, the motivation, general pipeline, and terminologies of
this field are described. Then the common deep neural network architectures
that used for self-supervised learning are summarized. Next, the main
components and evaluation metrics of self-supervised learning methods are
reviewed followed by the commonly used image and video datasets and the
existing self-supervised visual feature learning methods. Finally, quantitative
performance comparisons of the reviewed methods on benchmark datasets are
summarized and discussed for both image and video feature learning. At last,
this paper is concluded and lists a set of promising future directions for
self-supervised visual feature learning
Unsupervised Bi-directional Flow-based Video Generation from one Snapshot
Imagining multiple consecutive frames given one single snapshot is
challenging, since it is difficult to simultaneously predict diverse motions
from a single image and faithfully generate novel frames without visual
distortions. In this work, we leverage an unsupervised variational model to
learn rich motion patterns in the form of long-term bi-directional flow fields,
and apply the predicted flows to generate high-quality video sequences. In
contrast to the state-of-the-art approach, our method does not require external
flow supervisions for learning. This is achieved through a novel module that
performs bi-directional flows prediction from a single image. In addition, with
the bi-directional flow consistency check, our method can handle occlusion and
warping artifacts in a principled manner. Our method can be trained end-to-end
based on arbitrarily sampled natural video clips, and it is able to capture
multi-modal motion uncertainty and synthesizes photo-realistic novel sequences.
Quantitative and qualitative evaluations over synthetic and real-world datasets
demonstrate the effectiveness of the proposed approach over the
state-of-the-art methods.Comment: 11 pages, 12 figures. Technical report for a project in progres
Disentangling Motion, Foreground and Background Features in Videos
This paper introduces an unsupervised framework to extract semantically rich
features for video representation. Inspired by how the human visual system
groups objects based on motion cues, we propose a deep convolutional neural
network that disentangles motion, foreground and background information. The
proposed architecture consists of a 3D convolutional feature encoder for blocks
of 16 frames, which is trained for reconstruction tasks over the first and last
frames of the sequence. A preliminary supervised experiment was conducted to
verify the feasibility of proposed method by training the model with a fraction
of videos from the UCF-101 dataset taking as ground truth the bounding boxes
around the activity regions. Qualitative results indicate that the network can
successfully segment foreground and background in videos as well as update the
foreground appearance based on disentangled motion features. The benefits of
these learned features are shown in a discriminative classification task, where
initializing the network with the proposed pretraining method outperforms both
random initialization and autoencoder pretraining. Our model and source code
are publicly available at https://imatge-upc.github.io/unsupervised-2017-cvprw/ .Comment: Poster presented at the CVPR 2017 Workshop Brave New Ideas for Motion
Representations in Video
- …