312 research outputs found
Pareto Optimized Large Mask Approach for Efficient and Background Humanoid Shape Removal
The purpose of automated video object removal is to not only detect and remove the object of interest automatically, but also to utilize background context to inpaint the foreground area. Video inpainting requires to fill spatiotemporal gaps in a video with convincing material, necessitating both temporal and spatial consistency; the inpainted part must seamlessly integrate into the background in a variety of scenes, and it must maintain a consistent appearance in subsequent frames even if its surroundings change noticeably. We introduce deep learning-based methodology for removing unwanted human-like shapes in videos. The method uses Pareto-optimized Generative Adversarial Networks (GANs) technology, which is a novel contribution. The system automatically selects the Region of Interest (ROI) for each humanoid shape and uses a skeleton detection module to determine which humanoid shape to retain. The semantic masks of human like shapes are created using a semantic-aware occlusion-robust model that has four primary components: feature extraction, and local, global, and semantic branches. The global branch encodes occlusion-aware information to make the extracted features resistant to occlusion, while the local branch retrieves fine-grained local characteristics. A modified big mask inpainting approach is employed to eliminate a person from the image, leveraging Fast Fourier convolutions and utilizing polygonal chains and rectangles with unpredictable aspect ratios. The inpainter network takes the input image and the mask to create an output image excluding the background humanoid shapes. The generator uses an encoder-decoder structure with included skip connections to recover spatial information and dilated convolution and squeeze and excitation blocks to make the regions behind the humanoid shapes consistent with their surroundings. The discriminator avoids dissimilar structure at the patch scale, and the refiner network catches features around the boundaries of each background humanoid shape. The efficiency was assessed using the Structural Learned Perceptual Image Patch Similarity, Frechet Inception Distance, and Similarity Index Measure metrics and showed promising results in fully automated background person removal task. The method is evaluated on two video object segmentation datasets (DAVIS indicating respective values of 0.02, FID of 5.01 and SSIM of 0.79 and YouTube-VOS, resulting in 0.03, 6.22, 0.78 respectively) as well a database of 66 distinct video sequences of people behind a desk in an office environment (0.02, 4.01, and 0.78 respectively).publishedVersio
Recommended from our members
A Novel Inpainting Framework for Virtual View Synthesis
Multi-view imaging has stimulated significant research to enhance the user experience of free viewpoint video, allowing interactive navigation between views and the freedom to select a desired view to watch. This usually involves transmitting both textural and depth information captured from different viewpoints to the receiver, to enable the synthesis of an arbitrary view. In rendering these virtual views, perceptual holes can appear due to certain regions, hidden in the original view by a closer object, becoming visible in the virtual view. To provide a high quality experience these holes must be filled in a visually plausible way, in a process known as inpainting. This is challenging because the missing information is generally unknown and the hole-regions can be large. Recently depth-based inpainting techniques have been proposed to address this challenge and while these generally perform better than non-depth assisted methods, they are not very robust and can produce perceptual artefacts.
This thesis presents a new inpainting framework that innovatively exploits depth and textural self-similarity characteristics to construct subjectively enhanced virtual viewpoints. The framework makes three significant contributions to the field: i) the exploitation of view information to jointly inpaint textural and depth hole regions; ii) the introduction of the novel concept of self-similarity characterisation which is combined with relevant depth information; and iii) an advanced self-similarity characterising scheme that automatically determines key spatial transform parameters for effective and flexible inpainting.
The presented inpainting framework has been critically analysed and shown to provide superior performance both perceptually and numerically compared to existing techniques, especially in terms of lower visual artefacts. It provides a flexible robust framework to develop new inpainting strategies for the next generation of interactive multi-view technologies
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model
Text-to-image generative models have attracted rising attention for flexible
image editing via user-specified descriptions. However, text descriptions alone
are not enough to elaborate the details of subjects, often compromising the
subjects' identity or requiring additional per-subject fine-tuning. We
introduce a new framework called \textit{Paste, Inpaint and Harmonize via
Denoising} (PhD), which leverages an exemplar image in addition to text
descriptions to specify user intentions. In the pasting step, an off-the-shelf
segmentation model is employed to identify a user-specified subject within an
exemplar image which is subsequently inserted into a background image to serve
as an initialization capturing both scene context and subject identity in one.
To guarantee the visual coherence of the generated or edited image, we
introduce an inpainting and harmonizing module to guide the pre-trained
diffusion model to seamlessly blend the inserted subject into the scene
naturally. As we keep the pre-trained diffusion model frozen, we preserve its
strong image synthesis ability and text-driven ability, thus achieving
high-quality results and flexible editing with diverse texts. In our
experiments, we apply PhD to both subject-driven image editing tasks and
explore text-driven scene generation given a reference subject. Both
quantitative and qualitative comparisons with baseline methods demonstrate that
our approach achieves state-of-the-art performance in both tasks. More
qualitative results can be found at
\url{https://sites.google.com/view/phd-demo-page}.Comment: 10 pages, 12 figure
Structure-Guided Image Completion with Image-level and Object-level Semantic Discriminators
Structure-guided image completion aims to inpaint a local region of an image
according to an input guidance map from users. While such a task enables many
practical applications for interactive editing, existing methods often struggle
to hallucinate realistic object instances in complex natural scenes. Such a
limitation is partially due to the lack of semantic-level constraints inside
the hole region as well as the lack of a mechanism to enforce realistic object
generation. In this work, we propose a learning paradigm that consists of
semantic discriminators and object-level discriminators for improving the
generation of complex semantics and objects. Specifically, the semantic
discriminators leverage pretrained visual features to improve the realism of
the generated visual concepts. Moreover, the object-level discriminators take
aligned instances as inputs to enforce the realism of individual objects. Our
proposed scheme significantly improves the generation quality and achieves
state-of-the-art results on various tasks, including segmentation-guided
completion, edge-guided manipulation and panoptically-guided manipulation on
Places2 datasets. Furthermore, our trained model is flexible and can support
multiple editing use cases, such as object insertion, replacement, removal and
standard inpainting. In particular, our trained model combined with a novel
automatic image completion pipeline achieves state-of-the-art results on the
standard inpainting task.Comment: 18 pages, 16 figure
3DFill:Reference-guided Image Inpainting by Self-supervised 3D Image Alignment
Most existing image inpainting algorithms are based on a single view,
struggling with large holes or the holes containing complicated scenes. Some
reference-guided algorithms fill the hole by referring to another viewpoint
image and use 2D image alignment. Due to the camera imaging process, simple 2D
transformation is difficult to achieve a satisfactory result. In this paper, we
propose 3DFill, a simple and efficient method for reference-guided image
inpainting. Given a target image with arbitrary hole regions and a reference
image from another viewpoint, the 3DFill first aligns the two images by a
two-stage method: 3D projection + 2D transformation, which has better results
than 2D image alignment. The 3D projection is an overall alignment between
images and the 2D transformation is a local alignment focused on the hole
region. The entire process of image alignment is self-supervised. We then fill
the hole in the target image with the contents of the aligned image. Finally,
we use a conditional generation network to refine the filled image to obtain
the inpainting result. 3DFill achieves state-of-the-art performance on image
inpainting across a variety of wide view shifts and has a faster inference
speed than other inpainting models
Livrable D2.2 of the PERSEE project : Analyse/Synthese de Texture
Livrable D2.2 du projet ANR PERSEECe rapport a été réalisé dans le cadre du projet ANR PERSEE (n° ANR-09-BLAN-0170). Exactement il correspond au livrable D2.2 du projet. Son titre : Analyse/Synthese de Textur
- …