Search CORE

4,232 research outputs found

Video-to-Video Synthesis

Author: Catanzaro Bryan
Kautz Jan
Liu Guilin
Liu Ming-Yu
Tao Andrew
Wang Ting-Chun
Zhu Jun-Yan
Publication venue
Publication date: 03/12/2018
Field of study

We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e.g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video. While its image counterpart, the image-to-image synthesis problem, is a popular topic, the video-to-video synthesis problem is less explored in the literature. Without understanding temporal dynamics, directly applying existing image synthesis approaches to an input video often results in temporally incoherent videos of low visual quality. In this paper, we propose a novel video-to-video synthesis approach under the generative adversarial learning framework. Through carefully-designed generator and discriminator architectures, coupled with a spatio-temporal adversarial objective, we achieve high-resolution, photorealistic, temporally coherent video results on a diverse set of input formats including segmentation masks, sketches, and poses. Experiments on multiple benchmarks show the advantage of our method compared to strong baselines. In particular, our model is capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Finally, we apply our approach to future video prediction, outperforming several state-of-the-art competing systems.Comment: In NeurIPS, 2018. Code, models, and more results are available at https://github.com/NVIDIA/vid2vi

arXiv.org e-Print Archive

Visual Affordance and Function Understanding: A Survey

Author: Hassanin Mohammed
Khan Salman
Tahtali Murat
Publication venue
Publication date: 18/07/2018
Field of study

Nowadays, robots are dominating the manufacturing, entertainment and healthcare industries. Robot vision aims to equip robots with the ability to discover information, understand it and interact with the environment. These capabilities require an agent to effectively understand object affordances and functionalities in complex visual domains. In this literature survey, we first focus on Visual affordances and summarize the state of the art as well as open problems and research gaps. Specifically, we discuss sub-problems such as affordance detection, categorization, segmentation and high-level reasoning. Furthermore, we cover functional scene understanding and the prevalent functional descriptors used in the literature. The survey also provides necessary background to the problem, sheds light on its significance and highlights the existing challenges for affordance and functionality learning.Comment: 26 pages, 22 image

arXiv.org e-Print Archive

Learning Video Object Segmentation with Visual Memory

Author: Alahari Karteek
Schmid Cordelia
Tokmakov Pavel
Publication venue
Publication date: 12/07/2017
Field of study

This paper addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a "visual memory" in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. Given a video frame as input, our approach assigns each pixel an object or background label based on the learned spatio-temporal features as well as the "visual memory" specific to the video, acquired automatically without any manually-annotated frames. The visual memory is implemented with convolutional gated recurrent units, which allows to propagate spatial information over time. We evaluate our method extensively on two benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show state-of-the-art results. For example, our approach outperforms the top method on the DAVIS dataset by nearly 6%. We also provide an extensive ablative analysis to investigate the influence of each component in the proposed framework

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

An end-to-end generative framework for video segmentation and recognition

Author: Gall Juergen
Kuehne Hilde
Serre Thomas
Publication venue
Publication date: 17/03/2016
Field of study

We describe an end-to-end generative approach for the segmentation and recognition of human activities. In this approach, a visual representation based on reduced Fisher Vectors is combined with a structured temporal model for recognition. We show that the statistical properties of Fisher Vectors make them an especially suitable front-end for generative models such as Gaussian mixtures. The system is evaluated for both the recognition of complex activities as well as their parsing into action units. Using a variety of video datasets ranging from human cooking activities to animal behaviors, our experiments demonstrate that the resulting architecture outperforms state-of-the-art approaches for larger datasets, i.e. when sufficient amount of data is available for training structured generative models.Comment: Proc. of IEEE Winter Conference on Applications of Computer Vision (WACV), 201

arXiv.org e-Print Archive

Event segmentation and biological motion perception in watching dance

Author: Camurri Antonio
Darshane Nikhil
Glowinski Donald
Jola Corinne
Kalyanasundaram Sandhiya
McAleer Phil
Murphy Helen
Noble Katie
Penfield Kedzie
Pollick Frank E.
Publication venue
Publication date: 01/01/2014
Field of study

We used a combination of behavioral, computational vision and fMRI methods to examine human brain activity while viewing a 386 s video of a solo Bharatanatyam dance. A computational analysis provided us with a Motion Index (MI) quantifying the silhouette motion of the dancer throughout the dance. A behavioral analysis using 30 naïve observers provided us with the time points where observers were most likely to report event boundaries where one movement segment ended and another began. These behavioral and computational data were used to interpret the brain activity of a different set of 11 naïve observers who viewed the dance video while brain activity was measured using fMRI. Results showed that the Motion Index related to brain activity in a single cluster in the right Inferior Temporal Gyrus (ITG) in the vicinity of the Extrastriate Body Area (EBA). Perception of event boundaries in the video was related to the BA44 region of right Inferior Frontal Gyrus as well as extensive clusters of bilateral activity in the Inferior Occipital Gyrus which extended in the right hemisphere towards the posterior Superior Temporal Sulcus (pSTS)

Enlighten

Archivio istituzionale della ricerca - Università di Genova

Tukey-Inspired Video Object Segmentation

Author: Corso Jason J.
Griffin Brent A.
Publication venue
Publication date: 29/11/2018
Field of study

We investigate the problem of strictly unsupervised video object segmentation, i.e., the separation of a primary object from background in video without a user-provided object mask or any training on an annotated dataset. We find foreground objects in low-level vision data using a John Tukey-inspired measure of "outlierness". This Tukey-inspired measure also estimates the reliability of each data source as video characteristics change (e.g., a camera starts moving). The proposed method achieves state-of-the-art results for strictly unsupervised video object segmentation on the challenging DAVIS dataset. Finally, we use a variant of the Tukey-inspired measure to combine the output of multiple segmentation methods, including those using supervision during training, runtime, or both. This collectively more robust method of segmentation improves the Jaccard measure of its constituent methods by as much as 28%

arXiv.org e-Print Archive

A Hajj And Umrah Location Classification System For Video Crowded Scenes

Author: Aly Salah A.
Gutub Adnan A.
Zawbaa Hossam M.
Publication venue
Publication date: 15/09/2012
Field of study

In this paper, a new automatic system for classifying ritual locations in diverse Hajj and Umrah video scenes is investigated. This challenging subject has mostly been ignored in the past due to several problems one of which is the lack of realistic annotated video datasets. HUER Dataset is defined to model six different Hajj and Umrah ritual locations[26]. The proposed Hajj and Umrah ritual location classifying system consists of four main phases: Preprocessing, segmentation, feature extraction, and location classification phases. The shot boundary detection and background/foregroud segmentation algorithms are applied to prepare the input video scenes into the KNN, ANN, and SVM classifiers. The system improves the state of art results on Hajj and Umrah location classifications, and successfully recognizes the six Hajj rituals with more than 90% accuracy. The various demonstrated experiments show the promising results.Comment: 9 pages, 10 figures, 2 tables, 3 algirthm

arXiv.org e-Print Archive

End-to-end Learning of Driving Models from Large-scale Video Datasets

Author: Darrell Trevor
Gao Yang
Xu Huazhe
Yu Fisher
Publication venue
Publication date: 23/07/2017
Field of study

Robust perception-action models should be learned from training data with diverse visual appearances and realistic behaviors, yet current approaches to deep visuomotor policy learning have been generally limited to in-situ models learned from a single vehicle or a simulation environment. We advocate learning a generic vehicle motion model from large scale crowd-sourced video data, and develop an end-to-end trainable architecture for learning to predict a distribution over future vehicle egomotion from instantaneous monocular camera observations and previous vehicle state. Our model incorporates a novel FCN-LSTM architecture, which can be learned from large-scale crowd-sourced vehicle action data, and leverages available scene segmentation side tasks to improve performance under a privileged learning paradigm.Comment: camera ready for CVPR201

arXiv.org e-Print Archive

Adaptive Binarization for Weakly Supervised Affordance Segmentation

Author: Gall Juergen
Sawatzky Johann
Publication venue
Publication date: 10/07/2017
Field of study

The concept of affordance is important to understand the relevance of object parts for a certain functional interaction. Affordance types generalize across object categories and are not mutually exclusive. This makes the segmentation of affordance regions of objects in images a difficult task. In this work, we build on an iterative approach that learns a convolutional neural network for affordance segmentation from sparse keypoints. During this process, the predictions of the network need to be binarized. In this work, we propose an adaptive approach for binarization and estimate the parameters for initialization by approximated cross validation. We evaluate our approach on two affordance datasets where our approach outperforms the state-of-the-art for weakly supervised affordance segmentation

arXiv.org e-Print Archive

Human Motion Capture Data Tailored Transform Coding

Author: Chau Lap-Pui
He Ying
Hou Junhui
Magnenat-Thalmann Nadia
Publication venue
Publication date: 17/10/2014
Field of study

Human motion capture (mocap) is a widely used technique for digitalizing human movements. With growing usage, compressing mocap data has received increasing attention, since compact data size enables efficient storage and transmission. Our analysis shows that mocap data have some unique characteristics that distinguish themselves from images and videos. Therefore, directly borrowing image or video compression techniques, such as discrete cosine transform, does not work well. In this paper, we propose a novel mocap-tailored transform coding algorithm that takes advantage of these features. Our algorithm segments the input mocap sequences into clips, which are represented in 2D matrices. Then it computes a set of data-dependent orthogonal bases to transform the matrices to frequency domain, in which the transform coefficients have significantly less dependency. Finally, the compression is obtained by entropy coding of the quantized coefficients and the bases. Our method has low computational cost and can be easily extended to compress mocap databases. It also requires neither training nor complicated parameter setting. Experimental results demonstrate that the proposed scheme significantly outperforms state-of-the-art algorithms in terms of compression performance and speed

arXiv.org e-Print Archive