4,232 research outputs found
Video-to-Video Synthesis
We study the problem of video-to-video synthesis, whose goal is to learn a
mapping function from an input source video (e.g., a sequence of semantic
segmentation masks) to an output photorealistic video that precisely depicts
the content of the source video. While its image counterpart, the
image-to-image synthesis problem, is a popular topic, the video-to-video
synthesis problem is less explored in the literature. Without understanding
temporal dynamics, directly applying existing image synthesis approaches to an
input video often results in temporally incoherent videos of low visual
quality. In this paper, we propose a novel video-to-video synthesis approach
under the generative adversarial learning framework. Through carefully-designed
generator and discriminator architectures, coupled with a spatio-temporal
adversarial objective, we achieve high-resolution, photorealistic, temporally
coherent video results on a diverse set of input formats including segmentation
masks, sketches, and poses. Experiments on multiple benchmarks show the
advantage of our method compared to strong baselines. In particular, our model
is capable of synthesizing 2K resolution videos of street scenes up to 30
seconds long, which significantly advances the state-of-the-art of video
synthesis. Finally, we apply our approach to future video prediction,
outperforming several state-of-the-art competing systems.Comment: In NeurIPS, 2018. Code, models, and more results are available at
https://github.com/NVIDIA/vid2vi
Visual Affordance and Function Understanding: A Survey
Nowadays, robots are dominating the manufacturing, entertainment and
healthcare industries. Robot vision aims to equip robots with the ability to
discover information, understand it and interact with the environment. These
capabilities require an agent to effectively understand object affordances and
functionalities in complex visual domains. In this literature survey, we first
focus on Visual affordances and summarize the state of the art as well as open
problems and research gaps. Specifically, we discuss sub-problems such as
affordance detection, categorization, segmentation and high-level reasoning.
Furthermore, we cover functional scene understanding and the prevalent
functional descriptors used in the literature. The survey also provides
necessary background to the problem, sheds light on its significance and
highlights the existing challenges for affordance and functionality learning.Comment: 26 pages, 22 image
Learning Video Object Segmentation with Visual Memory
This paper addresses the task of segmenting moving objects in unconstrained
videos. We introduce a novel two-stream neural network with an explicit memory
module to achieve this. The two streams of the network encode spatial and
temporal features in a video sequence respectively, while the memory module
captures the evolution of objects over time. The module to build a "visual
memory" in video, i.e., a joint representation of all the video frames, is
realized with a convolutional recurrent unit learned from a small number of
training video sequences. Given a video frame as input, our approach assigns
each pixel an object or background label based on the learned spatio-temporal
features as well as the "visual memory" specific to the video, acquired
automatically without any manually-annotated frames. The visual memory is
implemented with convolutional gated recurrent units, which allows to propagate
spatial information over time. We evaluate our method extensively on two
benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show
state-of-the-art results. For example, our approach outperforms the top method
on the DAVIS dataset by nearly 6%. We also provide an extensive ablative
analysis to investigate the influence of each component in the proposed
framework
An end-to-end generative framework for video segmentation and recognition
We describe an end-to-end generative approach for the segmentation and
recognition of human activities. In this approach, a visual representation
based on reduced Fisher Vectors is combined with a structured temporal model
for recognition. We show that the statistical properties of Fisher Vectors make
them an especially suitable front-end for generative models such as Gaussian
mixtures. The system is evaluated for both the recognition of complex
activities as well as their parsing into action units. Using a variety of video
datasets ranging from human cooking activities to animal behaviors, our
experiments demonstrate that the resulting architecture outperforms
state-of-the-art approaches for larger datasets, i.e. when sufficient amount of
data is available for training structured generative models.Comment: Proc. of IEEE Winter Conference on Applications of Computer Vision
(WACV), 201
Event segmentation and biological motion perception in watching dance
We used a combination of behavioral, computational vision and fMRI methods to examine human brain activity while viewing a 386 s video of a solo Bharatanatyam dance. A computational analysis provided us with a Motion Index (MI) quantifying the silhouette motion of the dancer throughout the dance. A behavioral analysis using 30 naïve observers provided us with the time points where observers were most likely to report event boundaries where one movement segment ended and another began. These behavioral and computational data were used to interpret the brain activity of a different set of 11 naïve observers who viewed the dance video while brain activity was measured using fMRI. Results showed that the Motion Index related to brain activity in a single cluster in the right Inferior Temporal Gyrus (ITG) in the vicinity of the Extrastriate Body Area (EBA). Perception of event boundaries in the video was related to the BA44 region of right Inferior Frontal Gyrus as well as extensive clusters of bilateral activity in the Inferior Occipital Gyrus which extended in the right hemisphere towards the posterior Superior Temporal Sulcus (pSTS)
Tukey-Inspired Video Object Segmentation
We investigate the problem of strictly unsupervised video object
segmentation, i.e., the separation of a primary object from background in video
without a user-provided object mask or any training on an annotated dataset. We
find foreground objects in low-level vision data using a John Tukey-inspired
measure of "outlierness". This Tukey-inspired measure also estimates the
reliability of each data source as video characteristics change (e.g., a camera
starts moving). The proposed method achieves state-of-the-art results for
strictly unsupervised video object segmentation on the challenging DAVIS
dataset. Finally, we use a variant of the Tukey-inspired measure to combine the
output of multiple segmentation methods, including those using supervision
during training, runtime, or both. This collectively more robust method of
segmentation improves the Jaccard measure of its constituent methods by as much
as 28%
A Hajj And Umrah Location Classification System For Video Crowded Scenes
In this paper, a new automatic system for classifying ritual locations in
diverse Hajj and Umrah video scenes is investigated. This challenging subject
has mostly been ignored in the past due to several problems one of which is the
lack of realistic annotated video datasets. HUER Dataset is defined to model
six different Hajj and Umrah ritual locations[26].
The proposed Hajj and Umrah ritual location classifying system consists of
four main phases: Preprocessing, segmentation, feature extraction, and location
classification phases. The shot boundary detection and background/foregroud
segmentation algorithms are applied to prepare the input video scenes into the
KNN, ANN, and SVM classifiers. The system improves the state of art results on
Hajj and Umrah location classifications, and successfully recognizes the six
Hajj rituals with more than 90% accuracy. The various demonstrated experiments
show the promising results.Comment: 9 pages, 10 figures, 2 tables, 3 algirthm
End-to-end Learning of Driving Models from Large-scale Video Datasets
Robust perception-action models should be learned from training data with
diverse visual appearances and realistic behaviors, yet current approaches to
deep visuomotor policy learning have been generally limited to in-situ models
learned from a single vehicle or a simulation environment. We advocate learning
a generic vehicle motion model from large scale crowd-sourced video data, and
develop an end-to-end trainable architecture for learning to predict a
distribution over future vehicle egomotion from instantaneous monocular camera
observations and previous vehicle state. Our model incorporates a novel
FCN-LSTM architecture, which can be learned from large-scale crowd-sourced
vehicle action data, and leverages available scene segmentation side tasks to
improve performance under a privileged learning paradigm.Comment: camera ready for CVPR201
Adaptive Binarization for Weakly Supervised Affordance Segmentation
The concept of affordance is important to understand the relevance of object
parts for a certain functional interaction. Affordance types generalize across
object categories and are not mutually exclusive. This makes the segmentation
of affordance regions of objects in images a difficult task. In this work, we
build on an iterative approach that learns a convolutional neural network for
affordance segmentation from sparse keypoints. During this process, the
predictions of the network need to be binarized. In this work, we propose an
adaptive approach for binarization and estimate the parameters for
initialization by approximated cross validation. We evaluate our approach on
two affordance datasets where our approach outperforms the state-of-the-art for
weakly supervised affordance segmentation
Human Motion Capture Data Tailored Transform Coding
Human motion capture (mocap) is a widely used technique for digitalizing
human movements. With growing usage, compressing mocap data has received
increasing attention, since compact data size enables efficient storage and
transmission. Our analysis shows that mocap data have some unique
characteristics that distinguish themselves from images and videos. Therefore,
directly borrowing image or video compression techniques, such as discrete
cosine transform, does not work well. In this paper, we propose a novel
mocap-tailored transform coding algorithm that takes advantage of these
features. Our algorithm segments the input mocap sequences into clips, which
are represented in 2D matrices. Then it computes a set of data-dependent
orthogonal bases to transform the matrices to frequency domain, in which the
transform coefficients have significantly less dependency. Finally, the
compression is obtained by entropy coding of the quantized coefficients and the
bases. Our method has low computational cost and can be easily extended to
compress mocap databases. It also requires neither training nor complicated
parameter setting. Experimental results demonstrate that the proposed scheme
significantly outperforms state-of-the-art algorithms in terms of compression
performance and speed
- …