648 research outputs found
Free-Form Image Inpainting with Gated Convolution
We present a generative image inpainting system to complete images with
free-form mask and guidance. The system is based on gated convolutions learned
from millions of images without additional labelling efforts. The proposed
gated convolution solves the issue of vanilla convolution that treats all input
pixels as valid ones, generalizes partial convolution by providing a learnable
dynamic feature selection mechanism for each channel at each spatial location
across all layers. Moreover, as free-form masks may appear anywhere in images
with any shape, global and local GANs designed for a single rectangular mask
are not applicable. Thus, we also present a patch-based GAN loss, named
SN-PatchGAN, by applying spectral-normalized discriminator on dense image
patches. SN-PatchGAN is simple in formulation, fast and stable in training.
Results on automatic image inpainting and user-guided extension demonstrate
that our system generates higher-quality and more flexible results than
previous methods. Our system helps user quickly remove distracting objects,
modify image layouts, clear watermarks and edit faces. Code, demo and models
are available at: https://github.com/JiahuiYu/generative_inpaintingComment: Accepted in ICCV 2019 Oral; open sourced; interactive demo available:
http://jiahuiyu.com/deepfill
Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN
Free-form video inpainting is a very challenging task that could be widely
used for video editing such as text removal. Existing patch-based methods could
not handle non-repetitive structures such as faces, while directly applying
image-based inpainting models to videos will result in temporal inconsistency
(see http://bit.ly/2Fu1n6b ). In this paper, we introduce a deep learn-ing
based free-form video inpainting model, with proposed 3D gated convolutions to
tackle the uncertainty of free-form masks and a novel Temporal PatchGAN loss to
enhance temporal consistency. In addition, we collect videos and design a
free-form mask generation algorithm to build the free-form video inpainting
(FVI) dataset for training and evaluation of video inpainting models. We
demonstrate the benefits of these components and experiments on both the
FaceForensics and our FVI dataset suggest that our method is superior to
existing ones. Related source code, full-resolution result videos and the FVI
dataset could be found on Github
https://github.com/amjltc295/Free-Form-Video-Inpainting .Comment: Accepted to ICCV 201
Learnable Gated Temporal Shift Module for Deep Video Inpainting
How to efficiently utilize temporal information to recover videos in a
consistent way is the main issue for video inpainting problems. Conventional 2D
CNNs have achieved good performance on image inpainting but often lead to
temporally inconsistent results where frames will flicker when applied to
videos (see
https://www.youtube.com/watch?v=87Vh1HDBjD0&list=PLPoVtv-xp_dL5uckIzz1PKwNjg1yI0I94&index=1);
3D CNNs can capture temporal information but are computationally intensive and
hard to train. In this paper, we present a novel component termed Learnable
Gated Temporal Shift Module (LGTSM) for video inpainting models that could
effectively tackle arbitrary video masks without additional parameters from 3D
convolutions. LGTSM is designed to let 2D convolutions make use of neighboring
frames more efficiently, which is crucial for video inpainting. Specifically,
in each layer, LGTSM learns to shift some channels to its temporal neighbors so
that 2D convolutions could be enhanced to handle temporal information.
Meanwhile, a gated convolution is applied to the layer to identify the masked
areas that are poisoning for conventional convolutions. On the FaceForensics
and Free-form Video Inpainting (FVI) dataset, our model achieves
state-of-the-art results with simply 33% of parameters and inference time.Comment: Accepted to BMVC 201
PEPSI++: Fast and Lightweight Network for Image Inpainting
Among the various generative adversarial network (GAN)-based image inpainting
methods, a coarse-to-fine network with a contextual attention module (CAM) has
shown remarkable performance. However, owing to two stacked generative
networks, the coarse-to-fine network needs numerous computational resources
such as convolution operations and network parameters, which result in low
speed. To address this problem, we propose a novel network architecture called
PEPSI: parallel extended-decoder path for semantic inpainting network, which
aims at reducing the hardware costs and improving the inpainting performance.
PEPSI consists of a single shared encoding network and parallel decoding
networks called coarse and inpainting paths. The coarse path produces a
preliminary inpainting result to train the encoding network for the prediction
of features for the CAM. Simultaneously, the inpainting path generates higher
inpainting quality using the refined features reconstructed via the CAM. In
addition, we propose Diet-PEPSI that significantly reduces the network
parameters while maintaining the performance. In Diet-PEPSI, to capture the
global contextual information with low hardware costs, we propose novel
rate-adaptive dilated convolutional layers, which employ the common weights but
produce dynamic features depending on the given dilation rates. Extensive
experiments comparing the performance with state-of-the-art image inpainting
methods demonstrate that both PEPSI and Diet-PEPSI improve the qualitative
scores, i.e. the peak signal-to-noise ratio (PSNR) and structural similarity
(SSIM), as well as significantly reduce hardware costs such as computational
time and the number of network parameters.Comment: Accepted to IEEE transactions on Neural Networks and Learning
Systems. To be publishe
Coherent Semantic Attention for Image Inpainting
The latest deep learning-based approaches have shown promising results for
the challenging task of inpainting missing regions of an image. However, the
existing methods often generate contents with blurry textures and distorted
structures due to the discontinuity of the local pixels. From a semantic-level
perspective, the local pixel discontinuity is mainly because these methods
ignore the semantic relevance and feature continuity of hole regions. To handle
this problem, we investigate the human behavior in repairing pictures and
propose a fined deep generative model-based approach with a novel coherent
semantic attention (CSA) layer, which can not only preserve contextual
structure but also make more effective predictions of missing parts by modeling
the semantic relevance between the holes features. The task is divided into
rough, refinement as two steps and model each step with a neural network under
the U-Net architecture, where the CSA layer is embedded into the encoder of
refinement step. To stabilize the network training process and promote the CSA
layer to learn more effective parameters, we propose a consistency loss to
enforce the both the CSA layer and the corresponding layer of the CSA in
decoder to be close to the VGG feature layer of a ground truth image
simultaneously. The experiments on CelebA, Places2, and Paris StreetView
datasets have validated the effectiveness of our proposed methods in image
inpainting tasks and can obtain images with a higher quality as compared with
the existing state-of-the-art approaches
Foreground-aware Image Inpainting
Existing image inpainting methods typically fill holes by borrowing
information from surrounding pixels. They often produce unsatisfactory results
when the holes overlap with or touch foreground objects due to lack of
information about the actual extent of foreground and background regions within
the holes. These scenarios, however, are very important in practice, especially
for applications such as the removal of distracting objects. To address the
problem, we propose a foreground-aware image inpainting system that explicitly
disentangles structure inference and content completion. Specifically, our
model learns to predict the foreground contour first, and then inpaints the
missing region using the predicted contour as guidance. We show that by such
disentanglement, the contour completion model predicts reasonable contours of
objects, and further substantially improves the performance of image
inpainting. Experiments show that our method significantly outperforms existing
methods and achieves superior inpainting results on challenging cases with
complex compositions.Comment: Camera Ready version of CVPR 2019 with supplementary material
Texture Modeling with Convolutional Spike-and-Slab RBMs and Deep Extensions
We apply the spike-and-slab Restricted Boltzmann Machine (ssRBM) to texture
modeling. The ssRBM with tiled-convolution weight sharing (TssRBM) achieves or
surpasses the state-of-the-art on texture synthesis and inpainting by
parametric models. We also develop a novel RBM model with a spike-and-slab
visible layer and binary variables in the hidden layer. This model is designed
to be stacked on top of the TssRBM. We show the resulting deep belief network
(DBN) is a powerful generative model that improves on single-layer models and
is capable of modeling not only single high-resolution and challenging textures
but also multiple textures
Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence
Blind video decaptioning is a problem of automatically removing text overlays
and inpainting the occluded parts in videos without any input masks. While
recent deep learning based inpainting methods deal with a single image and
mostly assume that the positions of the corrupted pixels are known, we aim at
automatic text removal in video sequences without mask information. In this
paper, we propose a simple yet effective framework for fast blind video
decaptioning. We construct an encoder-decoder model, where the encoder takes
multiple source frames that can provide visible pixels revealed from the scene
dynamics. These hints are aggregated and fed into the decoder. We apply a
residual connection from the input frame to the decoder output to enforce our
network to focus on the corrupted regions only. Our proposed model was ranked
in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2:
Video decaptioning. In addition, we further improve this strong model by
applying a recurrent feedback. The recurrent feedback not only enforces
temporal coherence but also provides strong clues on where the corrupted pixels
are. Both qualitative and quantitative experiments demonstrate that our full
model produces accurate and temporally consistent video results in real time
(50+ fps).Comment: Accepted at CVPR 201
Contextual Attention Mechanism, SRGAN Based Inpainting System for Eliminating Interruptions from Images
The new alternative is to use deep learning to inpaint any image by utilizing
image classification and computer vision techniques. In general, image
inpainting is a task of recreating or reconstructing any broken image which
could be a photograph or oil/acrylic painting. With the advancement in the
field of Artificial Intelligence, this topic has become popular among AI
enthusiasts. With our approach, we propose an initial end-to-end pipeline for
inpainting images using a complete Machine Learning approach instead of a
conventional application-based approach. We first use the YOLO model to
automatically identify and localize the object we wish to remove from the
image. Using the result obtained from the model we can generate a mask for the
same. After this, we provide the masked image and original image to the GAN
model which uses the Contextual Attention method to fill in the region. It
consists of two generator networks and two discriminator networks and is also
called a coarse-to-fine network structure. The two generators use fully
convolutional networks while the global discriminator gets hold of the entire
image as input while the local discriminator gets the grip of the filled region
as input. The contextual Attention mechanism is proposed to effectively borrow
the neighbor information from distant spatial locations for reconstructing the
missing pixels. The third part of our implementation uses SRGAN to resolve the
inpainted image back to its original size. Our work is inspired by the paper
Free-Form Image Inpainting with Gated Convolution and Generative Image
Inpainting with Contextual Attention
Align-and-Attend Network for Globally and Locally Coherent Video Inpainting
We propose a novel feed-forward network for video inpainting. We use a set of
sampled video frames as the reference to take visible contents to fill the hole
of a target frame. Our video inpainting network consists of two stages. The
first stage is an alignment module that uses computed homographies between the
reference frames and the target frame. The visible patches are then aggregated
based on the frame similarity to fill in the target holes roughly. The second
stage is a non-local attention module that matches the generated patches with
known reference patches (in space and time) to refine the previous global
alignment stage. Both stages consist of large spatial-temporal window size for
the reference and thus enable modeling long-range correlations between distant
information and the hole regions. Therefore, even challenging scenes with large
or slowly moving holes can be handled, which have been hardly modeled by
existing flow-based approach. Our network is also designed with a recurrent
propagation stream to encourage temporal consistency in video results.
Experiments on video object removal demonstrate that our method inpaints the
holes with globally and locally coherent contents
- …