Search CORE

855 research outputs found

VORNet: Spatio-temporally Consistent Video Inpainting for Object Removal

Author: Chang Ya-Liang
Hsu Winston
Liu Zhe Yu
Publication venue
Publication date: 14/04/2019
Field of study

Video object removal is a challenging task in video processing that often requires massive human efforts. Given the mask of the foreground object in each frame, the goal is to complete (inpaint) the object region and generate a video without the target object. While recently deep learning based methods have achieved great success on the image inpainting task, they often lead to inconsistent results between frames when applied to videos. In this work, we propose a novel learning-based Video Object Removal Network (VORNet) to solve the video object removal task in a spatio-temporally consistent manner, by combining the optical flow warping and image-based inpainting model. Experiments are done on our Synthesized Video Object Removal (SVOR) dataset based on the YouTube-VOS video segmentation dataset, and both the objective and subjective evaluation demonstrate that our VORNet generates more spatially and temporally consistent videos compared with existing methods.Comment: Accepted to CVPRW 201

arXiv.org e-Print Archive

Deep Video Inpainting

Author: Kim Dahun
Kweon In So
Lee Joon-Young
Woo Sanghyun
Publication venue
Publication date: 05/05/2019
Field of study

Video inpainting aims to fill spatio-temporal holes with plausible content in a video. Despite tremendous progress of deep neural networks for image inpainting, it is challenging to extend these methods to the video domain due to the additional time dimension. In this work, we propose a novel deep network architecture for fast video inpainting. Built upon an image-based encoder-decoder model, our framework is designed to collect and refine information from neighbor frames and synthesize still-unknown regions. At the same time, the output is enforced to be temporally consistent by a recurrent feedback and a temporal memory module. Compared with the state-of-the-art image inpainting algorithm, our method produces videos that are much more semantically correct and temporally smooth. In contrast to the prior video completion method which relies on time-consuming optimization, our method runs in near real-time while generating competitive video results. Finally, we applied our framework to video retargeting task, and obtain visually pleasing results.Comment: Accepted at CVPR 201

arXiv.org e-Print Archive

Frame-Recurrent Video Inpainting by Robust Optical Flow Inference

Author: Ding Yifan
Huang Haibin
Liu Jiaming
Wang Chuan
Wang Jue
Wang Liqiang
Publication venue
Publication date: 07/05/2019
Field of study

In this paper, we present a new inpainting framework for recovering missing regions of video frames. Compared with image inpainting, performing this task on video presents new challenges such as how to preserving temporal consistency and spatial details, as well as how to handle arbitrary input video size and length fast and efficiently. Towards this end, we propose a novel deep learning architecture which incorporates ConvLSTM and optical flow for modeling the spatial-temporal consistency in videos. It also saves much computational resource such that our method can handle videos with larger frame size and arbitrary length streamingly in real-time. Furthermore, to generate an accurate optical flow from corrupted frames, we propose a robust flow generation module, where two sources of flows are fed and a flow blending network is trained to fuse them. We conduct extensive experiments to evaluate our method in various scenarios and different datasets, both qualitatively and quantitatively. The experimental results demonstrate the superior of our method compared with the state-of-the-art inpainting approaches

arXiv.org e-Print Archive

Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence

Author: Kim Dahun
Kweon In So
Lee Joon-Young
Woo Sanghyun
Publication venue
Publication date: 08/05/2019
Field of study

Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks. While recent deep learning based inpainting methods deal with a single image and mostly assume that the positions of the corrupted pixels are known, we aim at automatic text removal in video sequences without mask information. In this paper, we propose a simple yet effective framework for fast blind video decaptioning. We construct an encoder-decoder model, where the encoder takes multiple source frames that can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a residual connection from the input frame to the decoder output to enforce our network to focus on the corrupted regions only. Our proposed model was ranked in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2: Video decaptioning. In addition, we further improve this strong model by applying a recurrent feedback. The recurrent feedback not only enforces temporal coherence but also provides strong clues on where the corrupted pixels are. Both qualitative and quantitative experiments demonstrate that our full model produces accurate and temporally consistent video results in real time (50+ fps).Comment: Accepted at CVPR 201

arXiv.org e-Print Archive

Multi-View Inpainting for RGB-D Sequence

Author: Li Feiran
Ogasawara Tsukasa
Ricardez Gustavo Alfonso Garcia
Takamatsu Jun
Publication venue
Publication date: 21/11/2018
Field of study

In this work we propose a novel approach to remove undesired objects from RGB-D sequences captured with freely moving cameras, which enables static 3D reconstruction. Our method jointly uses existing information from multiple frames as well as generates new one via inpainting techniques. We use balanced rules to select source frames; local homography based image warping method for alignment and Markov random field (MRF) based approach for combining existing information. For the left holes, we employ exemplar based multi-view inpainting method to deal with the color image and coherently use it as guidance to complete the depth correspondence. Experiments show that our approach is qualified for removing the undesired objects and inpainting the holes.Comment: 10 page

arXiv.org e-Print Archive

Unsupervised Deep Context Prediction for Background Foreground Separation

Author: Javed Sajid
Jung Soon Ki
Mahmood Arif
Sultana Maryam
Publication venue
Publication date: 21/05/2018
Field of study

In many advanced video based applications background modeling is a pre-processing step to eliminate redundant data, for instance in tracking or video surveillance applications. Over the past years background subtraction is usually based on low level or hand-crafted features such as raw color components, gradients, or local binary patterns. The background subtraction algorithms performance suffer in the presence of various challenges such as dynamic backgrounds, photometric variations, camera jitters, and shadows. To handle these challenges for the purpose of accurate background modeling we propose a unified framework based on the algorithm of image inpainting. It is an unsupervised visual feature learning hybrid Generative Adversarial algorithm based on context prediction. We have also presented the solution of random region inpainting by the fusion of center region inpaiting and random region inpainting with the help of poisson blending technique. Furthermore we also evaluated foreground object detection with the fusion of our proposed method and morphological operations. The comparison of our proposed method with 12 state-of-the-art methods shows its stability in the application of background estimation and foreground detection.Comment: 17 page

arXiv.org e-Print Archive

Improving Video Generation for Multi-functional Applications

Author: Dinesh Acharya
Huang Zhiwu
Kratzwald Bernhard
Paudel Danda Pani
Van Gool Luc
Publication venue
Publication date: 14/03/2018
Field of study

In this paper, we aim to improve the state-of-the-art video generative adversarial networks (GANs) with a view towards multi-functional applications. Our improved video GAN model does not separate foreground from background nor dynamic from static patterns, but learns to generate the entire video clip conjointly. Our model can thus be trained to generate - and learn from - a broad set of videos with no restriction. This is achieved by designing a robust one-stream video generation architecture with an extension of the state-of-the-art Wasserstein GAN framework that allows for better convergence. The experimental results show that our improved video GAN model outperforms state-of-theart video generative models on multiple challenging datasets. Furthermore, we demonstrate the superiority of our model by successfully extending it to three challenging problems: video colorization, video inpainting, and future prediction. To the best of our knowledge, this is the first work using GANs to colorize and inpaint video clips

arXiv.org e-Print Archive

Coordinate-based Texture Inpainting for Pose-Guided Image Generation

Author: Grigorev Artur
Lempitsky Victor
Sevastopolsky Artem
Vakhitov Alexander
Publication venue
Publication date: 19/07/2019
Field of study

We present a new deep learning approach to pose-guided resynthesis of human photographs. At the heart of the new approach is the estimation of the complete body surface texture based on a single photograph. Since the input photograph always observes only a part of the surface, we suggest a new inpainting method that completes the texture of the human body. Rather than working directly with colors of texture elements, the inpainting network estimates an appropriate source location in the input image for each element of the body surface. This correspondence field between the input image and the texture is then further warped into the target image coordinate frame based on the desired pose, effectively establishing the correspondence between the source and the target view even when the pose change is drastic. The final convolutional network then uses the established correspondence and all other available information to synthesize the output image. A fully-convolutional architecture with deformable skip connections guided by the estimated correspondence field is used. We show state-of-the-art result for pose-guided image synthesis. Additionally, we demonstrate the performance of our system for garment transfer and pose-guided face resynthesis.Comment: Published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 201

arXiv.org e-Print Archive

On Variational Methods for Motion Compensated Inpainting

Author: Lauze Francois
Nielsen Mads
Publication venue
Publication date: 21/09/2018
Field of study

We develop in this paper a generic Bayesian framework for the joint estimation of motion and recovery of missing data in a damaged video sequence. Using standard maximum a posteriori to variational formulation rationale, we derive generic minimum energy formulations for the estimation of a reconstructed sequence as well as motion recovery. We instantiate these energy formulations and from their Euler-Lagrange Equations, we propose a full multiresolution algorithms in order to compute good local minimizers for our energies and discuss their numerical implementations, focusing on the missing data recovery part, i.e. inpainting. Experimental results for synthetic as well as real sequences are presented. Image sequences and extra material is available at http://image.diku.dk/francois/seqinp.php.Comment: DIKU Technical report 2009 with some small correction

arXiv.org e-Print Archive

Deep Inception Generative Network for Cognitive Image Inpainting

Author: Chen Qiaochuan
Li Guangyao
Xiao Qingguo
Publication venue
Publication date: 30/11/2018
Field of study

Recent advances in deep learning have shown exciting promise in filling large holes and lead to another orientation for image inpainting. However, existing learning-based methods often create artifacts and fallacious textures because of insufficient cognition understanding. Previous generative networks are limited with single receptive type and give up pooling in consideration of detail sharpness. Human cognition is constant regardless of the target attribute. As multiple receptive fields improve the ability of abstract image characterization and pooling can keep feature invariant, specifically, deep inception learning is adopted to promote high-level feature representation and enhance model learning capacity for local patches. Moreover, approaches for generating diverse mask images are introduced and a random mask dataset is created. We benchmark our methods on ImageNet, Places2 dataset, and CelebA-HQ. Experiments for regular, irregular, and custom regions completion are all performed and free-style image inpainting is also presented. Quantitative comparisons with previous state-of-the-art methods show that ours obtain much more natural image completions

arXiv.org e-Print Archive