2,009 research outputs found
VORNet: Spatio-temporally Consistent Video Inpainting for Object Removal
Video object removal is a challenging task in video processing that often
requires massive human efforts. Given the mask of the foreground object in each
frame, the goal is to complete (inpaint) the object region and generate a video
without the target object. While recently deep learning based methods have
achieved great success on the image inpainting task, they often lead to
inconsistent results between frames when applied to videos. In this work, we
propose a novel learning-based Video Object Removal Network (VORNet) to solve
the video object removal task in a spatio-temporally consistent manner, by
combining the optical flow warping and image-based inpainting model.
Experiments are done on our Synthesized Video Object Removal (SVOR) dataset
based on the YouTube-VOS video segmentation dataset, and both the objective and
subjective evaluation demonstrate that our VORNet generates more spatially and
temporally consistent videos compared with existing methods.Comment: Accepted to CVPRW 201
Align-and-Attend Network for Globally and Locally Coherent Video Inpainting
We propose a novel feed-forward network for video inpainting. We use a set of
sampled video frames as the reference to take visible contents to fill the hole
of a target frame. Our video inpainting network consists of two stages. The
first stage is an alignment module that uses computed homographies between the
reference frames and the target frame. The visible patches are then aggregated
based on the frame similarity to fill in the target holes roughly. The second
stage is a non-local attention module that matches the generated patches with
known reference patches (in space and time) to refine the previous global
alignment stage. Both stages consist of large spatial-temporal window size for
the reference and thus enable modeling long-range correlations between distant
information and the hole regions. Therefore, even challenging scenes with large
or slowly moving holes can be handled, which have been hardly modeled by
existing flow-based approach. Our network is also designed with a recurrent
propagation stream to encourage temporal consistency in video results.
Experiments on video object removal demonstrate that our method inpaints the
holes with globally and locally coherent contents
Unsupervised Deep Context Prediction for Background Foreground Separation
In many advanced video based applications background modeling is a
pre-processing step to eliminate redundant data, for instance in tracking or
video surveillance applications. Over the past years background subtraction is
usually based on low level or hand-crafted features such as raw color
components, gradients, or local binary patterns. The background subtraction
algorithms performance suffer in the presence of various challenges such as
dynamic backgrounds, photometric variations, camera jitters, and shadows. To
handle these challenges for the purpose of accurate background modeling we
propose a unified framework based on the algorithm of image inpainting. It is
an unsupervised visual feature learning hybrid Generative Adversarial algorithm
based on context prediction. We have also presented the solution of random
region inpainting by the fusion of center region inpaiting and random region
inpainting with the help of poisson blending technique. Furthermore we also
evaluated foreground object detection with the fusion of our proposed method
and morphological operations. The comparison of our proposed method with 12
state-of-the-art methods shows its stability in the application of background
estimation and foreground detection.Comment: 17 page
Deep Video Inpainting
Video inpainting aims to fill spatio-temporal holes with plausible content in
a video. Despite tremendous progress of deep neural networks for image
inpainting, it is challenging to extend these methods to the video domain due
to the additional time dimension. In this work, we propose a novel deep network
architecture for fast video inpainting. Built upon an image-based
encoder-decoder model, our framework is designed to collect and refine
information from neighbor frames and synthesize still-unknown regions. At the
same time, the output is enforced to be temporally consistent by a recurrent
feedback and a temporal memory module. Compared with the state-of-the-art image
inpainting algorithm, our method produces videos that are much more
semantically correct and temporally smooth. In contrast to the prior video
completion method which relies on time-consuming optimization, our method runs
in near real-time while generating competitive video results. Finally, we
applied our framework to video retargeting task, and obtain visually pleasing
results.Comment: Accepted at CVPR 201
Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN
Free-form video inpainting is a very challenging task that could be widely
used for video editing such as text removal. Existing patch-based methods could
not handle non-repetitive structures such as faces, while directly applying
image-based inpainting models to videos will result in temporal inconsistency
(see http://bit.ly/2Fu1n6b ). In this paper, we introduce a deep learn-ing
based free-form video inpainting model, with proposed 3D gated convolutions to
tackle the uncertainty of free-form masks and a novel Temporal PatchGAN loss to
enhance temporal consistency. In addition, we collect videos and design a
free-form mask generation algorithm to build the free-form video inpainting
(FVI) dataset for training and evaluation of video inpainting models. We
demonstrate the benefits of these components and experiments on both the
FaceForensics and our FVI dataset suggest that our method is superior to
existing ones. Related source code, full-resolution result videos and the FVI
dataset could be found on Github
https://github.com/amjltc295/Free-Form-Video-Inpainting .Comment: Accepted to ICCV 201
Learning the Depths of Moving People by Watching Frozen People
We present a method for predicting dense depth in scenarios where both a
monocular camera and people in the scene are freely moving. Existing methods
for recovering depth for dynamic, non-rigid objects from monocular video impose
strong assumptions on the objects' motion and may only recover sparse depth. In
this paper, we take a data-driven approach and learn human depth priors from a
new source of data: thousands of Internet videos of people imitating
mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera
tours the scene. Because people are stationary, training data can be generated
using multi-view stereo reconstruction. At inference time, our method uses
motion parallax cues from the static areas of the scenes to guide the depth
prediction. We demonstrate our method on real-world sequences of complex human
actions captured by a moving hand-held camera, show improvement over
state-of-the-art monocular depth prediction methods, and show various 3D
effects produced using our predicted depth.Comment: CVPR 2019 (Oral
Multi-View Inpainting for RGB-D Sequence
In this work we propose a novel approach to remove undesired objects from
RGB-D sequences captured with freely moving cameras, which enables static 3D
reconstruction. Our method jointly uses existing information from multiple
frames as well as generates new one via inpainting techniques. We use balanced
rules to select source frames; local homography based image warping method for
alignment and Markov random field (MRF) based approach for combining existing
information. For the left holes, we employ exemplar based multi-view inpainting
method to deal with the color image and coherently use it as guidance to
complete the depth correspondence. Experiments show that our approach is
qualified for removing the undesired objects and inpainting the holes.Comment: 10 page
My camera can see through fences: A deep learning approach for image de-fencing
In recent times, the availability of inexpensive image capturing devices such
as smartphones/tablets has led to an exponential increase in the number of
images/videos captured. However, sometimes the amateur photographer is hindered
by fences in the scene which have to be removed after the image has been
captured. Conventional approaches to image de-fencing suffer from inaccurate
and non-robust fence detection apart from being limited to processing images of
only static occluded scenes. In this paper, we propose a semi-automated
de-fencing algorithm using a video of the dynamic scene. We use convolutional
neural networks for detecting fence pixels. We provide qualitative as well as
quantitative comparison results with existing lattice detection algorithms on
the existing PSU NRT data set and a proposed challenging fenced image dataset.
The inverse problem of fence removal is solved using split Bregman technique
assuming total variation of the de-fenced image as the regularization
constraint.Comment: ACPR 2015, Kuala Lumpu
Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence
Blind video decaptioning is a problem of automatically removing text overlays
and inpainting the occluded parts in videos without any input masks. While
recent deep learning based inpainting methods deal with a single image and
mostly assume that the positions of the corrupted pixels are known, we aim at
automatic text removal in video sequences without mask information. In this
paper, we propose a simple yet effective framework for fast blind video
decaptioning. We construct an encoder-decoder model, where the encoder takes
multiple source frames that can provide visible pixels revealed from the scene
dynamics. These hints are aggregated and fed into the decoder. We apply a
residual connection from the input frame to the decoder output to enforce our
network to focus on the corrupted regions only. Our proposed model was ranked
in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2:
Video decaptioning. In addition, we further improve this strong model by
applying a recurrent feedback. The recurrent feedback not only enforces
temporal coherence but also provides strong clues on where the corrupted pixels
are. Both qualitative and quantitative experiments demonstrate that our full
model produces accurate and temporally consistent video results in real time
(50+ fps).Comment: Accepted at CVPR 201
Improving Video Generation for Multi-functional Applications
In this paper, we aim to improve the state-of-the-art video generative
adversarial networks (GANs) with a view towards multi-functional applications.
Our improved video GAN model does not separate foreground from background nor
dynamic from static patterns, but learns to generate the entire video clip
conjointly. Our model can thus be trained to generate - and learn from - a
broad set of videos with no restriction. This is achieved by designing a robust
one-stream video generation architecture with an extension of the
state-of-the-art Wasserstein GAN framework that allows for better convergence.
The experimental results show that our improved video GAN model outperforms
state-of-theart video generative models on multiple challenging datasets.
Furthermore, we demonstrate the superiority of our model by successfully
extending it to three challenging problems: video colorization, video
inpainting, and future prediction. To the best of our knowledge, this is the
first work using GANs to colorize and inpaint video clips
- …