609 research outputs found
Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence
Blind video decaptioning is a problem of automatically removing text overlays
and inpainting the occluded parts in videos without any input masks. While
recent deep learning based inpainting methods deal with a single image and
mostly assume that the positions of the corrupted pixels are known, we aim at
automatic text removal in video sequences without mask information. In this
paper, we propose a simple yet effective framework for fast blind video
decaptioning. We construct an encoder-decoder model, where the encoder takes
multiple source frames that can provide visible pixels revealed from the scene
dynamics. These hints are aggregated and fed into the decoder. We apply a
residual connection from the input frame to the decoder output to enforce our
network to focus on the corrupted regions only. Our proposed model was ranked
in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2:
Video decaptioning. In addition, we further improve this strong model by
applying a recurrent feedback. The recurrent feedback not only enforces
temporal coherence but also provides strong clues on where the corrupted pixels
are. Both qualitative and quantitative experiments demonstrate that our full
model produces accurate and temporally consistent video results in real time
(50+ fps).Comment: Accepted at CVPR 201
Deep Video Inpainting
Video inpainting aims to fill spatio-temporal holes with plausible content in
a video. Despite tremendous progress of deep neural networks for image
inpainting, it is challenging to extend these methods to the video domain due
to the additional time dimension. In this work, we propose a novel deep network
architecture for fast video inpainting. Built upon an image-based
encoder-decoder model, our framework is designed to collect and refine
information from neighbor frames and synthesize still-unknown regions. At the
same time, the output is enforced to be temporally consistent by a recurrent
feedback and a temporal memory module. Compared with the state-of-the-art image
inpainting algorithm, our method produces videos that are much more
semantically correct and temporally smooth. In contrast to the prior video
completion method which relies on time-consuming optimization, our method runs
in near real-time while generating competitive video results. Finally, we
applied our framework to video retargeting task, and obtain visually pleasing
results.Comment: Accepted at CVPR 201
VORNet: Spatio-temporally Consistent Video Inpainting for Object Removal
Video object removal is a challenging task in video processing that often
requires massive human efforts. Given the mask of the foreground object in each
frame, the goal is to complete (inpaint) the object region and generate a video
without the target object. While recently deep learning based methods have
achieved great success on the image inpainting task, they often lead to
inconsistent results between frames when applied to videos. In this work, we
propose a novel learning-based Video Object Removal Network (VORNet) to solve
the video object removal task in a spatio-temporally consistent manner, by
combining the optical flow warping and image-based inpainting model.
Experiments are done on our Synthesized Video Object Removal (SVOR) dataset
based on the YouTube-VOS video segmentation dataset, and both the objective and
subjective evaluation demonstrate that our VORNet generates more spatially and
temporally consistent videos compared with existing methods.Comment: Accepted to CVPRW 201
Frame-Recurrent Video Inpainting by Robust Optical Flow Inference
In this paper, we present a new inpainting framework for recovering missing
regions of video frames. Compared with image inpainting, performing this task
on video presents new challenges such as how to preserving temporal consistency
and spatial details, as well as how to handle arbitrary input video size and
length fast and efficiently. Towards this end, we propose a novel deep learning
architecture which incorporates ConvLSTM and optical flow for modeling the
spatial-temporal consistency in videos. It also saves much computational
resource such that our method can handle videos with larger frame size and
arbitrary length streamingly in real-time. Furthermore, to generate an accurate
optical flow from corrupted frames, we propose a robust flow generation module,
where two sources of flows are fed and a flow blending network is trained to
fuse them. We conduct extensive experiments to evaluate our method in various
scenarios and different datasets, both qualitatively and quantitatively. The
experimental results demonstrate the superior of our method compared with the
state-of-the-art inpainting approaches
Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN
Free-form video inpainting is a very challenging task that could be widely
used for video editing such as text removal. Existing patch-based methods could
not handle non-repetitive structures such as faces, while directly applying
image-based inpainting models to videos will result in temporal inconsistency
(see http://bit.ly/2Fu1n6b ). In this paper, we introduce a deep learn-ing
based free-form video inpainting model, with proposed 3D gated convolutions to
tackle the uncertainty of free-form masks and a novel Temporal PatchGAN loss to
enhance temporal consistency. In addition, we collect videos and design a
free-form mask generation algorithm to build the free-form video inpainting
(FVI) dataset for training and evaluation of video inpainting models. We
demonstrate the benefits of these components and experiments on both the
FaceForensics and our FVI dataset suggest that our method is superior to
existing ones. Related source code, full-resolution result videos and the FVI
dataset could be found on Github
https://github.com/amjltc295/Free-Form-Video-Inpainting .Comment: Accepted to ICCV 201
Improving Consistency and Correctness of Sequence Inpainting using Semantically Guided Generative Adversarial Network
Contemporary benchmark methods for image inpainting are based on deep
generative models and specifically leverage adversarial loss for yielding
realistic reconstructions. However, these models cannot be directly applied on
image/video sequences because of an intrinsic drawback- the reconstructions
might be independently realistic, but, when visualized as a sequence, often
lacks fidelity to the original uncorrupted sequence. The fundamental reason is
that these methods try to find the best matching latent space representation
near to natural image manifold without any explicit distance based loss. In
this paper, we present a semantically conditioned Generative Adversarial
Network (GAN) for sequence inpainting. The conditional information constrains
the GAN to map a latent representation to a point in image manifold respecting
the underlying pose and semantics of the scene. To the best of our knowledge,
this is the first work which simultaneously addresses consistency and
correctness of generative model based inpainting. We show that our generative
model learns to disentangle pose and appearance information; this independence
is exploited by our model to generate highly consistent reconstructions. The
conditional information also aids the generator network in GAN to produce
sharper images compared to the original GAN formulation. This helps in
achieving more appealing inpainting performance. Though generic, our algorithm
was targeted for inpainting on faces. When applied on CelebA and Youtube Faces
datasets, the proposed method results in a significant improvement over the
current benchmark, both in terms of quantitative evaluation (Peak Signal to
Noise Ratio) and human visual scoring over diversified combinations of
resolutions and deformations
Align-and-Attend Network for Globally and Locally Coherent Video Inpainting
We propose a novel feed-forward network for video inpainting. We use a set of
sampled video frames as the reference to take visible contents to fill the hole
of a target frame. Our video inpainting network consists of two stages. The
first stage is an alignment module that uses computed homographies between the
reference frames and the target frame. The visible patches are then aggregated
based on the frame similarity to fill in the target holes roughly. The second
stage is a non-local attention module that matches the generated patches with
known reference patches (in space and time) to refine the previous global
alignment stage. Both stages consist of large spatial-temporal window size for
the reference and thus enable modeling long-range correlations between distant
information and the hole regions. Therefore, even challenging scenes with large
or slowly moving holes can be handled, which have been hardly modeled by
existing flow-based approach. Our network is also designed with a recurrent
propagation stream to encourage temporal consistency in video results.
Experiments on video object removal demonstrate that our method inpaints the
holes with globally and locally coherent contents
Improving Video Generation for Multi-functional Applications
In this paper, we aim to improve the state-of-the-art video generative
adversarial networks (GANs) with a view towards multi-functional applications.
Our improved video GAN model does not separate foreground from background nor
dynamic from static patterns, but learns to generate the entire video clip
conjointly. Our model can thus be trained to generate - and learn from - a
broad set of videos with no restriction. This is achieved by designing a robust
one-stream video generation architecture with an extension of the
state-of-the-art Wasserstein GAN framework that allows for better convergence.
The experimental results show that our improved video GAN model outperforms
state-of-theart video generative models on multiple challenging datasets.
Furthermore, we demonstrate the superiority of our model by successfully
extending it to three challenging problems: video colorization, video
inpainting, and future prediction. To the best of our knowledge, this is the
first work using GANs to colorize and inpaint video clips
Yes, we GAN: Applying Adversarial Techniques for Autonomous Driving
Generative Adversarial Networks (GAN) have gained a lot of popularity from
their introduction in 2014 till present. Research on GAN is rapidly growing and
there are many variants of the original GAN focusing on various aspects of deep
learning. GAN are perceived as the most impactful direction of machine learning
in the last decade. This paper focuses on the application of GAN in autonomous
driving including topics such as advanced data augmentation, loss function
learning, semi-supervised learning, etc. We formalize and review key
applications of adversarial techniques and discuss challenges and open problems
to be addressed.Comment: Accepted for publication in Electronic Imaging, Autonomous Vehicles
and Machines 2019. arXiv admin note: text overlap with arXiv:1606.05908 by
other author
The Angel is in the Priors: Improving GAN based Image and Sequence Inpainting with Better Noise and Structural Priors
Contemporary deep learning based inpainting algorithms are mainly based on a
hybrid dual stage training policy of supervised reconstruction loss followed by
an unsupervised adversarial critic loss. However, there is a dearth of
literature for a fully unsupervised GAN based inpainting framework. The primary
aversion towards the latter genre is due to its prohibitively slow iterative
optimization requirement during inference to find a matching noise prior for a
masked image. In this paper, we show that priors matter in GAN: we learn a data
driven parametric network to predict a matching prior for a given image. This
converts an iterative paradigm to a single feed forward inference pipeline with
a massive 1500X speedup and simultaneous improvement in reconstruction quality.
We show that an additional structural prior imposed on GAN model results in
higher fidelity outputs. To extend our model for sequence inpainting, we
propose a recurrent net based grouped noise prior learning. To our knowledge,
this is the first demonstration of an unsupervised GAN based sequence
inpainting. A further improvement in sequence inpainting is achieved with an
additional subsequence consistency loss. These contributions improve the
spatio-temporal characteristics of reconstructed sequences. Extensive
experiments conducted on SVHN, Standford Cars, CelebA and CelebA-HQ image
datasets, synthetic sequences and ViDTIMIT video datasets reveal that we
consistently improve upon previous unsupervised baseline and also achieve
comparable performances(sometimes also better) to hybrid benchmarks
- …