1,307 research outputs found
Stochastic Dynamics for Video Infilling
In this paper, we introduce a stochastic dynamics video infilling (SDVI)
framework to generate frames between long intervals in a video. Our task
differs from video interpolation which aims to produce transitional frames for
a short interval between every two frames and increase the temporal resolution.
Our task, namely video infilling, however, aims to infill long intervals with
plausible frame sequences. Our framework models the infilling as a constrained
stochastic generation process and sequentially samples dynamics from the
inferred distribution. SDVI consists of two parts: (1) a bi-directional
constraint propagation module to guarantee the spatial-temporal coherence among
frames, (2) a stochastic sampling process to generate dynamics from the
inferred distributions. Experimental results show that SDVI can generate clear
frame sequences with varying contents. Moreover, motions in the generated
sequence are realistic and able to transfer smoothly from the given start frame
to the terminal frame. Our project site is
https://xharlie.github.io/projects/project_sites/SDVI/video_results.htmlComment: Winter Conference on Applications of Computer Vision (WACV 2020
Deep Video Generation, Prediction and Completion of Human Action Sequences
Current deep learning results on video generation are limited while there are
only a few first results on video prediction and no relevant significant
results on video completion. This is due to the severe ill-posedness inherent
in these three problems. In this paper, we focus on human action videos, and
propose a general, two-stage deep framework to generate human action videos
with no constraints or arbitrary number of constraints, which uniformly address
the three problems: video generation given no input frames, video prediction
given the first few frames, and video completion given the first and last
frames. To make the problem tractable, in the first stage we train a deep
generative model that generates a human pose sequence from random noise. In the
second stage, a skeleton-to-image network is trained, which is used to generate
a human action video given the complete human pose sequence generated in the
first stage. By introducing the two-stage strategy, we sidestep the original
ill-posed problems while producing for the first time high-quality video
generation/prediction/completion results of much longer duration. We present
quantitative and qualitative evaluation to show that our two-stage approach
outperforms state-of-the-art methods in video generation, prediction and video
completion. Our video result demonstration can be viewed at
https://iamacewhite.github.io/supp/index.htmlComment: Under review for CVPR 2018. Haoye and Chunyan have equal contributio
Recommended from our members
Understanding the Dynamic Visual World: From Motion to Semantics
We live in a dynamic world, which is continuously in motion. Perceiving and interpreting the dynamic surroundings is an essential capability for an intelligent agent. Human beings have the remarkable capability to learn from limited data, with partial or little annotation, in sharp contrast to computational perception models that rely on large-scale, manually labeled data. Reliance on strongly supervised models with manually labeled data inherently prohibits us from modeling the dynamic visual world, as manual annotations are tedious, expensive, and not scalable, especially if we would like to solve multiple scene understanding tasks at the same time. Even worse, in some cases, manual annotations are completely infeasible, such as the motion vector of each pixel (i.e., optical flow) since humans cannot reliably produce these types of labeling. In fact, living in a dynamic world, when we move around, motion information, as a result of moving camera, independently moving objects, and scene geometry, consists of abundant information, revealing the structure and complexity of our dynamic visual world. As the famous psychologist James J. Gibson suggested, “we must perceive in order to move, but we also must move in order to perceive”. In this thesis, we investigate how to use the motion information contained in unlabeled or partially labeled videos to better understand and synthesize the dynamic visual world.
This thesis consists of three parts. In the first part, we focus on the “move to perceive” aspect. When moving through the world, it is natural for an intelligent agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, far away mountains don’t move much; nearby trees move a lot. This natural relationship between the appearance of objects and their apparent motion is a rich source of information about the relationship between the distance of objects and their appearance in images. We present a pretext task of estimating the relative depth of elements of a scene (i.e., ordering the pixels in an image according to distance from the viewer) recovered from motion field of unlabeled videos. The goal of this pretext task was to induce useful feature representations in deep Convolutional Neural Networks (CNNs). These induced representations, using 1.1 million video frames crawled from YouTube within one hour without any manual labeling, provide valuable starting features for the training of neural networks for downstream tasks. It is promising to match or even surpass what ImageNet pre-training gives us today, which needs a huge amount of manual labeling, on tasks such as semantic image segmentation as all of our training data comes almost for free.
In the second part, we study the “perceive to move” aspect. As we humans look around, we do not solve a single vision task at a time. Instead, we perceive our surroundings in a holistic manner, doing visual understanding using all visual cues jointly. By simultaneously solving multiple tasks together, one task can influence another. In specific, we propose a neural network architecture, called SENSE, which shares common feature representations among four closely-related tasks: optical flow estimation, disparity estimation from stereo, occlusion detection, and semantic segmentation. The key insight is that sharing features makes the network more compact and induces better feature representations. For real-world data, however, not all an- notations of the four tasks mentioned above are always available at the same time. To this end, loss functions are designed to exploit interactions of different tasks and do not need manual annotations, to better handle partially labeled data in a semi- supervised manner, leading to superior understanding performance of the dynamic visual world.
Understanding the motion contained in a video enables us to perceive the dynamic visual world in a novel manner. In the third part, we present an approach, called SuperSloMo, which synthesizes slow-motion videos from a standard frame-rate video. Converting a plain video into a slow-motion version enables us to see memorable moments in our life that are hard to see clearly otherwise with naked eyes: a difficult skateboard trick, a dog catching a ball, etc. Such a technique also has wide applications such as generating smooth view transition on a head-mounted virtual reality (VR) devices, compressing videos, synthesizing videos with motion blur, etc
Learning to Generate 3D Training Data
Human-level visual 3D perception ability has long been pursued by researchers in computer vision, computer graphics, and robotics. Recent years have seen an emerging line of works using synthetic images to train deep networks for single image 3D perception. Synthetic images rendered by graphics engines are a promising source for training deep neural networks because it comes with perfect 3D ground truth for free. However, the 3D shapes and scenes to be rendered are largely made manual. Besides, it is challenging to ensure that synthetic images collected this way can help train a deep network to perform well on real images. This is because graphics generation pipelines require numerous design decisions such as the selection of 3D shapes and the placement of the camera.
In this dissertation, we propose automatic generation pipelines of synthetic data that aim to improve the task performance of a trained network. We explore both supervised and unsupervised directions for automatic optimization of 3D decisions. For supervised learning, we demonstrate how to optimize 3D parameters such that a trained network can generalize well to real images. We first show that we can construct a pure synthetic 3D shape to achieve state-of-the-art performance on a shape-from-shading benchmark. We further parameterize the decisions as a vector and propose a hybrid gradient approach to efficiently optimize the vector towards usefulness. Our hybrid gradient is able to outperform classic black-box approaches on a wide selection of 3D perception tasks. For unsupervised learning, we propose a novelty metric for 3D parameter evolution based on deep autoregressive models. We show that without any extrinsic motivation, the novelty computed from autoregressive models alone is helpful. Our novelty metric can consistently encourage a random synthetic generator to produce more useful training data for downstream 3D perception tasks.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163240/1/ydawei_1.pd
- …