4,090 research outputs found
Deep Video Inpainting
Video inpainting aims to fill spatio-temporal holes with plausible content in
a video. Despite tremendous progress of deep neural networks for image
inpainting, it is challenging to extend these methods to the video domain due
to the additional time dimension. In this work, we propose a novel deep network
architecture for fast video inpainting. Built upon an image-based
encoder-decoder model, our framework is designed to collect and refine
information from neighbor frames and synthesize still-unknown regions. At the
same time, the output is enforced to be temporally consistent by a recurrent
feedback and a temporal memory module. Compared with the state-of-the-art image
inpainting algorithm, our method produces videos that are much more
semantically correct and temporally smooth. In contrast to the prior video
completion method which relies on time-consuming optimization, our method runs
in near real-time while generating competitive video results. Finally, we
applied our framework to video retargeting task, and obtain visually pleasing
results.Comment: Accepted at CVPR 201
Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN
Free-form video inpainting is a very challenging task that could be widely
used for video editing such as text removal. Existing patch-based methods could
not handle non-repetitive structures such as faces, while directly applying
image-based inpainting models to videos will result in temporal inconsistency
(see http://bit.ly/2Fu1n6b ). In this paper, we introduce a deep learn-ing
based free-form video inpainting model, with proposed 3D gated convolutions to
tackle the uncertainty of free-form masks and a novel Temporal PatchGAN loss to
enhance temporal consistency. In addition, we collect videos and design a
free-form mask generation algorithm to build the free-form video inpainting
(FVI) dataset for training and evaluation of video inpainting models. We
demonstrate the benefits of these components and experiments on both the
FaceForensics and our FVI dataset suggest that our method is superior to
existing ones. Related source code, full-resolution result videos and the FVI
dataset could be found on Github
https://github.com/amjltc295/Free-Form-Video-Inpainting .Comment: Accepted to ICCV 201
VORNet: Spatio-temporally Consistent Video Inpainting for Object Removal
Video object removal is a challenging task in video processing that often
requires massive human efforts. Given the mask of the foreground object in each
frame, the goal is to complete (inpaint) the object region and generate a video
without the target object. While recently deep learning based methods have
achieved great success on the image inpainting task, they often lead to
inconsistent results between frames when applied to videos. In this work, we
propose a novel learning-based Video Object Removal Network (VORNet) to solve
the video object removal task in a spatio-temporally consistent manner, by
combining the optical flow warping and image-based inpainting model.
Experiments are done on our Synthesized Video Object Removal (SVOR) dataset
based on the YouTube-VOS video segmentation dataset, and both the objective and
subjective evaluation demonstrate that our VORNet generates more spatially and
temporally consistent videos compared with existing methods.Comment: Accepted to CVPRW 201
Frame-Recurrent Video Inpainting by Robust Optical Flow Inference
In this paper, we present a new inpainting framework for recovering missing
regions of video frames. Compared with image inpainting, performing this task
on video presents new challenges such as how to preserving temporal consistency
and spatial details, as well as how to handle arbitrary input video size and
length fast and efficiently. Towards this end, we propose a novel deep learning
architecture which incorporates ConvLSTM and optical flow for modeling the
spatial-temporal consistency in videos. It also saves much computational
resource such that our method can handle videos with larger frame size and
arbitrary length streamingly in real-time. Furthermore, to generate an accurate
optical flow from corrupted frames, we propose a robust flow generation module,
where two sources of flows are fed and a flow blending network is trained to
fuse them. We conduct extensive experiments to evaluate our method in various
scenarios and different datasets, both qualitatively and quantitatively. The
experimental results demonstrate the superior of our method compared with the
state-of-the-art inpainting approaches
Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence
Blind video decaptioning is a problem of automatically removing text overlays
and inpainting the occluded parts in videos without any input masks. While
recent deep learning based inpainting methods deal with a single image and
mostly assume that the positions of the corrupted pixels are known, we aim at
automatic text removal in video sequences without mask information. In this
paper, we propose a simple yet effective framework for fast blind video
decaptioning. We construct an encoder-decoder model, where the encoder takes
multiple source frames that can provide visible pixels revealed from the scene
dynamics. These hints are aggregated and fed into the decoder. We apply a
residual connection from the input frame to the decoder output to enforce our
network to focus on the corrupted regions only. Our proposed model was ranked
in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2:
Video decaptioning. In addition, we further improve this strong model by
applying a recurrent feedback. The recurrent feedback not only enforces
temporal coherence but also provides strong clues on where the corrupted pixels
are. Both qualitative and quantitative experiments demonstrate that our full
model produces accurate and temporally consistent video results in real time
(50+ fps).Comment: Accepted at CVPR 201
Copy-and-Paste Networks for Deep Video Inpainting
We present a novel deep learning based algorithm for video inpainting. Video
inpainting is a process of completing corrupted or missing regions in videos.
Video inpainting has additional challenges compared to image inpainting due to
the extra temporal information as well as the need for maintaining the temporal
coherency. We propose a novel DNN-based framework called the Copy-and-Paste
Networks for video inpainting that takes advantage of additional information in
other frames of the video. The network is trained to copy corresponding
contents in reference frames and paste them to fill the holes in the target
frame. Our network also includes an alignment network that computes affine
matrices between frames for the alignment, enabling the network to take
information from more distant frames for robustness. Our method produces
visually pleasing and temporally coherent results while running faster than the
state-of-the-art optimization-based method. In addition, we extend our
framework for enhancing over/under exposed frames in videos. Using this
enhancement technique, we were able to significantly improve the lane detection
accuracy on road videos.Comment: ICCV 201
Deep Long Audio Inpainting
Long (> 200 ms) audio inpainting, to recover a long missing part in an audio
segment, could be widely applied to audio editing tasks and transmission loss
recovery. It is a very challenging problem due to the high dimensional, complex
and non-correlated audio features. While deep learning models have made
tremendous progress in image and video inpainting, audio inpainting did not
attract much attention. In this work, we take a pioneering step, exploring the
possibility of adapting deep learning frameworks from various domains inclusive
of audio synthesis and image inpainting for audio inpainting. Also, as the
first to systematically analyze factors affecting audio inpainting performance,
we explore how factors ranging from mask size, receptive field and audio
representation could affect the performance. We also set up a benchmark for
long audio inpainting. The code will be available on GitHub upon accepted
Align-and-Attend Network for Globally and Locally Coherent Video Inpainting
We propose a novel feed-forward network for video inpainting. We use a set of
sampled video frames as the reference to take visible contents to fill the hole
of a target frame. Our video inpainting network consists of two stages. The
first stage is an alignment module that uses computed homographies between the
reference frames and the target frame. The visible patches are then aggregated
based on the frame similarity to fill in the target holes roughly. The second
stage is a non-local attention module that matches the generated patches with
known reference patches (in space and time) to refine the previous global
alignment stage. Both stages consist of large spatial-temporal window size for
the reference and thus enable modeling long-range correlations between distant
information and the hole regions. Therefore, even challenging scenes with large
or slowly moving holes can be handled, which have been hardly modeled by
existing flow-based approach. Our network is also designed with a recurrent
propagation stream to encourage temporal consistency in video results.
Experiments on video object removal demonstrate that our method inpaints the
holes with globally and locally coherent contents
Improving Consistency and Correctness of Sequence Inpainting using Semantically Guided Generative Adversarial Network
Contemporary benchmark methods for image inpainting are based on deep
generative models and specifically leverage adversarial loss for yielding
realistic reconstructions. However, these models cannot be directly applied on
image/video sequences because of an intrinsic drawback- the reconstructions
might be independently realistic, but, when visualized as a sequence, often
lacks fidelity to the original uncorrupted sequence. The fundamental reason is
that these methods try to find the best matching latent space representation
near to natural image manifold without any explicit distance based loss. In
this paper, we present a semantically conditioned Generative Adversarial
Network (GAN) for sequence inpainting. The conditional information constrains
the GAN to map a latent representation to a point in image manifold respecting
the underlying pose and semantics of the scene. To the best of our knowledge,
this is the first work which simultaneously addresses consistency and
correctness of generative model based inpainting. We show that our generative
model learns to disentangle pose and appearance information; this independence
is exploited by our model to generate highly consistent reconstructions. The
conditional information also aids the generator network in GAN to produce
sharper images compared to the original GAN formulation. This helps in
achieving more appealing inpainting performance. Though generic, our algorithm
was targeted for inpainting on faces. When applied on CelebA and Youtube Faces
datasets, the proposed method results in a significant improvement over the
current benchmark, both in terms of quantitative evaluation (Peak Signal to
Noise Ratio) and human visual scoring over diversified combinations of
resolutions and deformations
Improving Video Generation for Multi-functional Applications
In this paper, we aim to improve the state-of-the-art video generative
adversarial networks (GANs) with a view towards multi-functional applications.
Our improved video GAN model does not separate foreground from background nor
dynamic from static patterns, but learns to generate the entire video clip
conjointly. Our model can thus be trained to generate - and learn from - a
broad set of videos with no restriction. This is achieved by designing a robust
one-stream video generation architecture with an extension of the
state-of-the-art Wasserstein GAN framework that allows for better convergence.
The experimental results show that our improved video GAN model outperforms
state-of-theart video generative models on multiple challenging datasets.
Furthermore, we demonstrate the superiority of our model by successfully
extending it to three challenging problems: video colorization, video
inpainting, and future prediction. To the best of our knowledge, this is the
first work using GANs to colorize and inpaint video clips
- …