Search CORE

215 research outputs found

Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model

Author: Guo Jiaxian
Iwasawa Yusuke
Matsuo Yutaka
Yoo Paul
Zhang Xin
Publication venue
Publication date: 13/06/2023
Field of study

Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{https://sites.google.com/view/phd-demo-page}.Comment: 10 pages, 12 figure

arXiv.org e-Print Archive

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

Author: Hu Han
Koike Hideki
Peng Houwen
Qiu Lili
Shen Yifei
Sun Yasheng
Yang Yifan
Yang Yuqing
Publication venue
Publication date: 01/08/2023
Field of study

While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting

arXiv.org e-Print Archive

Real-World Image Variation by Aligning Diffusion Inversion Chain

Author: Jia Jiaya
Lo Eric
Xing Jinbo
Zhang Yuechen
Publication venue
Publication date: 30/05/2023
Field of study

Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods with respect to semantic-condition similarity and perceptual quality. Furthermore, this generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and example-based image inpainting.Comment: 19 pages; Project page: https://rival-diff.github.io/ Code(release later): https://github.com/julianjuaner/RIVAL

arXiv.org e-Print Archive

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Author: Liu Ziwei
Loy Chen Change
Yang Shuai
Zhou Yifan
Publication venue
Publication date: 17/09/2023
Field of study

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.Comment: Accepted to SIGGRAPH Asia 2023. Project page: https://www.mmlab-ntu.com/project/rerender

arXiv.org e-Print Archive

Recommended from our members

A Novel Inpainting Framework for Virtual View Synthesis

Author: Reel Smarti
Publication venue
Publication date: 17/05/2017
Field of study

Multi-view imaging has stimulated significant research to enhance the user experience of free viewpoint video, allowing interactive navigation between views and the freedom to select a desired view to watch. This usually involves transmitting both textural and depth information captured from different viewpoints to the receiver, to enable the synthesis of an arbitrary view. In rendering these virtual views, perceptual holes can appear due to certain regions, hidden in the original view by a closer object, becoming visible in the virtual view. To provide a high quality experience these holes must be filled in a visually plausible way, in a process known as inpainting. This is challenging because the missing information is generally unknown and the hole-regions can be large. Recently depth-based inpainting techniques have been proposed to address this challenge and while these generally perform better than non-depth assisted methods, they are not very robust and can produce perceptual artefacts. This thesis presents a new inpainting framework that innovatively exploits depth and textural self-similarity characteristics to construct subjectively enhanced virtual viewpoints. The framework makes three significant contributions to the field: i) the exploitation of view information to jointly inpaint textural and depth hole regions; ii) the introduction of the novel concept of self-similarity characterisation which is combined with relevant depth information; and iii) an advanced self-similarity characterising scheme that automatically determines key spatial transform parameters for effective and flexible inpainting. The presented inpainting framework has been critically analysed and shown to provide superior performance both perceptually and numerically compared to existing techniques, especially in terms of lower visual artefacts. It provides a flexible robust framework to develop new inpainting strategies for the next generation of interactive multi-view technologies

Open Research Online (The Open University)

MAT: Mask-Aware Transformer for Large Hole Image Inpainting

Author: Jia Jiaya
Li Wenbo
Lin Zhe
Qi Lu
Wang Yi
Zhou Kun
Publication venue
Publication date: 30/03/2022
Field of study

Recent studies have shown the importance of modeling long-range interactions in the inpainting problem. To achieve this goal, existing approaches exploit either standalone attention techniques or transformers, but usually under a low resolution in consideration of computational cost. In this paper, we present a novel transformer-based model for large hole inpainting, which unifies the merits of transformers and convolutions to efficiently process high-resolution images. We carefully design each component of our framework to guarantee the high fidelity and diversity of recovered images. Specifically, we customize an inpainting-oriented transformer block, where the attention module aggregates non-local information only from partial valid tokens, indicated by a dynamic mask. Extensive experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets. Code is released at https://github.com/fenglinglwb/MAT.Comment: Accepted to CVPR2022 Ora

arXiv.org e-Print Archive

Methods for 3D Geometry Processing in the Cultural Heritage Domain

Author: Bendels Gerhard Heinrich
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

This thesis presents methods for 3D geometry processing under the aspects of cultural heritage applications. After a short overview over the relevant basics in 3D geometry processing, the present thesis investigates the digital acquisition of 3D models. A particular challenge in this context are on the one hand difficult surface or material properties of the model to be captured. On the other hand, the fully automatic reconstruction of models even with suitable surface properties that can be captured with Laser range scanners is not yet completely solved. This thesis presents two approaches to tackle these challenges. One exploits a thorough capture of the object’s appearance and a coarse reconstruction for a concise and realistic object representation even for objects with problematic surface properties like reflectivity and transparency. The other method concentrates on digitisation via Laser-range scanners and exploits 2D colour images that are typically recorded with the range images for a fully automatic registration technique. After reconstruction, the captured models are often still incomplete, exhibit holes and/or regions of insufficient sampling. In addition to that, holes are often deliberately introduced into a registered model to remove some undesired or defective surface part. In order to produce a visually appealing model, for instance for visualisation purposes, for prototype or replica production, these holes have to be detected and filled. Although completion is a well-established research field in 2D image processing and many approaches do exist for image completion, surface completion in 3D is a fairly new field of research. This thesis presents a hierarchical completion approach that employs and extends successful exemplar-based 2D image processing approaches to 3D and fills in detail-equipped surface patches into missing surface regions. In order to identify and construct suitable surface patches, selfsimilarity and coherence properties of the surface context of the hole are exploited. In addition to the reconstruction and repair, the present thesis also investigates methods for a modification of captured models via interactive modelling. In this context, modelling is regarded as a creative process, for instance for animation purposes. On the other hand, it is also demonstrated how this creative process can be used to introduce human expertise into the otherwise automatic completion process. This way, reconstructions are feasible even of objects where already the data source, the object itself, is incomplete due to corrosion, demolition, or decay.Methoden zur 3D-Geometrieverarbeitung im Kulturerbesektor In dieser Arbeit werden Methoden zur Bearbeitung von digitaler 3D-Geometrie unter besonderer Berücksichtigung des Anwendungsbereichs im Kulturerbesektor vorgestellt. Nach einem kurzen Überblick über die relevanten Grundlagen der dreidimensionalen Geometriebehandlung wird zunächst die digitale Akquise von dreidimensionalen Objekten untersucht. Eine besondere Herausforderung stellen bei der Erfassung einerseits ungünstige Oberflächen- oder Materialeigenschaften der Objekte dar (wie z.B. Reflexivität oder Transparenz), andererseits ist auch die vollautomatische Rekonstruktion von solchen Modellen, die sich verhältnismäßig problemlos mit Laser-Range Scannern erfassen lassen, immer noch nicht vollständig gelöst. Daher bilden zwei neuartige Verfahren, die diesen Herausforderungen begegnen, den Anfang. Auch nach der Registrierung sind die erfassten Datensätze in vielen Fällen unvollständig, weisen Löcher oder nicht ausreichend abgetastete Regionen auf. Darüber hinaus werden in vielen Anwendungen auch, z.B. durch Entfernen unerwünschter Oberflächenregionen, Löcher gewollt hinzugefügt. Für eine optisch ansprechende Rekonstruktion, vor allem zu Visualisierungszwecken, im Bildungs- oder Unterhaltungssektor oder zur Prototyp- und Replik-Erzeugung müssen diese Löcher zunächst automatisch detektiert und anschließend geschlossen werden. Obwohl dies im zweidimensionalen Fall der Bildbearbeitung bereits ein gut untersuchtes Forschungsfeld darstellt und vielfältige Ansätze zur automatischen Bildvervollständigung existieren, ist die Lage im dreidimensionalen Fall anders, und die Übertragung von zweidimensionalen Ansätzen in den 3D stellt vielfach eine große Herausforderung dar, die bislang keine zufriedenstellenden Lösungen erlaubt hat. Nichtsdestoweniger wird in dieser Arbeit ein hierarchisches Verfahren vorgestellt, das beispielbasierte Konzepte aus dem 2D aufgreift und Löcher in Oberflächen im 3D unter Ausnutzung von Selbstähnlichkeiten und Kohärenzeigenschaften des Oberflächenkontextes schließt. Um plausible Oberflächen zu erzeugen werden die Löcher dabei nicht nur glatt gefüllt, sondern auch feinere Details aus dem Kontext rekonstruiert. Abschließend untersucht die vorliegende Arbeit noch die Modifikation der vervollständigten Objekte durch Freiformmodellierung. Dies wird dabei zum einen als kreativer Prozess z.B. zu Animationszwecken betrachtet. Zum anderen wird aber auch untersucht, wie dieser kreative Prozess benutzt werden kann, um etwaig vorhandenes Expertenwissen in die ansonsten automatische Vervollständigung mit einfließen zu lassen. Auf diese Weise werden auch Rekonstruktionen ermöglicht von Objekten, bei denen schon die Datenquelle, also das Objekt selbst z.B. durch Korrosion oder mutwillige Zerstörung unvollständig ist

bonndoc – Der Publikationsserver der Universität Bonn