215 research outputs found

    Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model

    Full text link
    Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{https://sites.google.com/view/phd-demo-page}.Comment: 10 pages, 12 figure

    ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

    Full text link
    While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting

    Real-World Image Variation by Aligning Diffusion Inversion Chain

    Full text link
    Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods with respect to semantic-condition similarity and perceptual quality. Furthermore, this generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and example-based image inpainting.Comment: 19 pages; Project page: https://rival-diff.github.io/ Code(release later): https://github.com/julianjuaner/RIVAL

    Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

    Full text link
    Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.Comment: Accepted to SIGGRAPH Asia 2023. Project page: https://www.mmlab-ntu.com/project/rerender

    MAT: Mask-Aware Transformer for Large Hole Image Inpainting

    Full text link
    Recent studies have shown the importance of modeling long-range interactions in the inpainting problem. To achieve this goal, existing approaches exploit either standalone attention techniques or transformers, but usually under a low resolution in consideration of computational cost. In this paper, we present a novel transformer-based model for large hole inpainting, which unifies the merits of transformers and convolutions to efficiently process high-resolution images. We carefully design each component of our framework to guarantee the high fidelity and diversity of recovered images. Specifically, we customize an inpainting-oriented transformer block, where the attention module aggregates non-local information only from partial valid tokens, indicated by a dynamic mask. Extensive experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets. Code is released at https://github.com/fenglinglwb/MAT.Comment: Accepted to CVPR2022 Ora

    Methods for 3D Geometry Processing in the Cultural Heritage Domain

    Get PDF
    This thesis presents methods for 3D geometry processing under the aspects of cultural heritage applications. After a short overview over the relevant basics in 3D geometry processing, the present thesis investigates the digital acquisition of 3D models. A particular challenge in this context are on the one hand difficult surface or material properties of the model to be captured. On the other hand, the fully automatic reconstruction of models even with suitable surface properties that can be captured with Laser range scanners is not yet completely solved. This thesis presents two approaches to tackle these challenges. One exploits a thorough capture of the object’s appearance and a coarse reconstruction for a concise and realistic object representation even for objects with problematic surface properties like reflectivity and transparency. The other method concentrates on digitisation via Laser-range scanners and exploits 2D colour images that are typically recorded with the range images for a fully automatic registration technique. After reconstruction, the captured models are often still incomplete, exhibit holes and/or regions of insufficient sampling. In addition to that, holes are often deliberately introduced into a registered model to remove some undesired or defective surface part. In order to produce a visually appealing model, for instance for visualisation purposes, for prototype or replica production, these holes have to be detected and filled. Although completion is a well-established research field in 2D image processing and many approaches do exist for image completion, surface completion in 3D is a fairly new field of research. This thesis presents a hierarchical completion approach that employs and extends successful exemplar-based 2D image processing approaches to 3D and fills in detail-equipped surface patches into missing surface regions. In order to identify and construct suitable surface patches, selfsimilarity and coherence properties of the surface context of the hole are exploited. In addition to the reconstruction and repair, the present thesis also investigates methods for a modification of captured models via interactive modelling. In this context, modelling is regarded as a creative process, for instance for animation purposes. On the other hand, it is also demonstrated how this creative process can be used to introduce human expertise into the otherwise automatic completion process. This way, reconstructions are feasible even of objects where already the data source, the object itself, is incomplete due to corrosion, demolition, or decay.Methoden zur 3D-Geometrieverarbeitung im Kulturerbesektor In dieser Arbeit werden Methoden zur Bearbeitung von digitaler 3D-Geometrie unter besonderer BerĂŒcksichtigung des Anwendungsbereichs im Kulturerbesektor vorgestellt. Nach einem kurzen Überblick ĂŒber die relevanten Grundlagen der dreidimensionalen Geometriebehandlung wird zunĂ€chst die digitale Akquise von dreidimensionalen Objekten untersucht. Eine besondere Herausforderung stellen bei der Erfassung einerseits ungĂŒnstige OberflĂ€chen- oder Materialeigenschaften der Objekte dar (wie z.B. ReflexivitĂ€t oder Transparenz), andererseits ist auch die vollautomatische Rekonstruktion von solchen Modellen, die sich verhĂ€ltnismĂ€ĂŸig problemlos mit Laser-Range Scannern erfassen lassen, immer noch nicht vollstĂ€ndig gelöst. Daher bilden zwei neuartige Verfahren, die diesen Herausforderungen begegnen, den Anfang. Auch nach der Registrierung sind die erfassten DatensĂ€tze in vielen FĂ€llen unvollstĂ€ndig, weisen Löcher oder nicht ausreichend abgetastete Regionen auf. DarĂŒber hinaus werden in vielen Anwendungen auch, z.B. durch Entfernen unerwĂŒnschter OberflĂ€chenregionen, Löcher gewollt hinzugefĂŒgt. FĂŒr eine optisch ansprechende Rekonstruktion, vor allem zu Visualisierungszwecken, im Bildungs- oder Unterhaltungssektor oder zur Prototyp- und Replik-Erzeugung mĂŒssen diese Löcher zunĂ€chst automatisch detektiert und anschließend geschlossen werden. Obwohl dies im zweidimensionalen Fall der Bildbearbeitung bereits ein gut untersuchtes Forschungsfeld darstellt und vielfĂ€ltige AnsĂ€tze zur automatischen BildvervollstĂ€ndigung existieren, ist die Lage im dreidimensionalen Fall anders, und die Übertragung von zweidimensionalen AnsĂ€tzen in den 3D stellt vielfach eine große Herausforderung dar, die bislang keine zufriedenstellenden Lösungen erlaubt hat. Nichtsdestoweniger wird in dieser Arbeit ein hierarchisches Verfahren vorgestellt, das beispielbasierte Konzepte aus dem 2D aufgreift und Löcher in OberflĂ€chen im 3D unter Ausnutzung von SelbstĂ€hnlichkeiten und KohĂ€renzeigenschaften des OberflĂ€chenkontextes schließt. Um plausible OberflĂ€chen zu erzeugen werden die Löcher dabei nicht nur glatt gefĂŒllt, sondern auch feinere Details aus dem Kontext rekonstruiert. Abschließend untersucht die vorliegende Arbeit noch die Modifikation der vervollstĂ€ndigten Objekte durch Freiformmodellierung. Dies wird dabei zum einen als kreativer Prozess z.B. zu Animationszwecken betrachtet. Zum anderen wird aber auch untersucht, wie dieser kreative Prozess benutzt werden kann, um etwaig vorhandenes Expertenwissen in die ansonsten automatische VervollstĂ€ndigung mit einfließen zu lassen. Auf diese Weise werden auch Rekonstruktionen ermöglicht von Objekten, bei denen schon die Datenquelle, also das Objekt selbst z.B. durch Korrosion oder mutwillige Zerstörung unvollstĂ€ndig ist
    • 

    corecore