215 research outputs found
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model
Text-to-image generative models have attracted rising attention for flexible
image editing via user-specified descriptions. However, text descriptions alone
are not enough to elaborate the details of subjects, often compromising the
subjects' identity or requiring additional per-subject fine-tuning. We
introduce a new framework called \textit{Paste, Inpaint and Harmonize via
Denoising} (PhD), which leverages an exemplar image in addition to text
descriptions to specify user intentions. In the pasting step, an off-the-shelf
segmentation model is employed to identify a user-specified subject within an
exemplar image which is subsequently inserted into a background image to serve
as an initialization capturing both scene context and subject identity in one.
To guarantee the visual coherence of the generated or edited image, we
introduce an inpainting and harmonizing module to guide the pre-trained
diffusion model to seamlessly blend the inserted subject into the scene
naturally. As we keep the pre-trained diffusion model frozen, we preserve its
strong image synthesis ability and text-driven ability, thus achieving
high-quality results and flexible editing with diverse texts. In our
experiments, we apply PhD to both subject-driven image editing tasks and
explore text-driven scene generation given a reference subject. Both
quantitative and qualitative comparisons with baseline methods demonstrate that
our approach achieves state-of-the-art performance in both tasks. More
qualitative results can be found at
\url{https://sites.google.com/view/phd-demo-page}.Comment: 10 pages, 12 figure
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
While language-guided image manipulation has made remarkable progress, the
challenge of how to instruct the manipulation process faithfully reflecting
human intentions persists. An accurate and comprehensive description of a
manipulation task using natural language is laborious and sometimes even
impossible, primarily due to the inherent uncertainty and ambiguity present in
linguistic expressions. Is it feasible to accomplish image manipulation without
resorting to external cross-modal language information? If this possibility
exists, the inherent modality gap would be effortlessly eliminated. In this
paper, we propose a novel manipulation methodology, dubbed ImageBrush, that
learns visual instructions for more accurate image editing. Our key idea is to
employ a pair of transformation images as visual instructions, which not only
precisely captures human intention but also facilitates accessibility in
real-world scenarios. Capturing visual instructions is particularly challenging
because it involves extracting the underlying intentions solely from visual
demonstrations and then applying this operation to a new image. To address this
challenge, we formulate visual instruction learning as a diffusion-based
inpainting problem, where the contextual information is fully exploited through
an iterative process of generation. A visual prompting encoder is carefully
devised to enhance the model's capacity in uncovering human intent behind the
visual instructions. Extensive experiments show that our method generates
engaging manipulation results conforming to the transformations entailed in
demonstrations. Moreover, our model exhibits robust generalization capabilities
on various downstream tasks such as pose transfer, image translation and video
inpainting
Real-World Image Variation by Aligning Diffusion Inversion Chain
Recent diffusion model advancements have enabled high-fidelity images to be
generated using text prompts. However, a domain gap exists between generated
images and real-world images, which poses a challenge in generating
high-quality variations of real-world images. Our investigation uncovers that
this domain gap originates from a latents' distribution gap in different
diffusion processes. To address this issue, we propose a novel inference
pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes
diffusion models to generate image variations from a single image exemplar. Our
pipeline enhances the generation quality of image variations by aligning the
image generation process to the source image's inversion chain. Specifically,
we demonstrate that step-wise latent distribution alignment is essential for
generating high-quality variations. To attain this, we design a cross-image
self-attention injection for feature interaction and a step-wise distribution
normalization to align the latent features. Incorporating these alignment
processes into a diffusion model allows RIVAL to generate high-quality image
variations without further parameter optimization. Our experimental results
demonstrate that our proposed approach outperforms existing methods with
respect to semantic-condition similarity and perceptual quality. Furthermore,
this generalized inference pipeline can be easily applied to other
diffusion-based generation tasks, such as image-conditioned text-to-image
generation and example-based image inpainting.Comment: 19 pages; Project page: https://rival-diff.github.io/ Code(release
later): https://github.com/julianjuaner/RIVAL
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Large text-to-image diffusion models have exhibited impressive proficiency in
generating high-quality images. However, when applying these models to video
domain, ensuring temporal consistency across video frames remains a formidable
challenge. This paper proposes a novel zero-shot text-guided video-to-video
translation framework to adapt image models to videos. The framework includes
two parts: key frame translation and full video translation. The first part
uses an adapted diffusion model to generate key frames, with hierarchical
cross-frame constraints applied to enforce coherence in shapes, textures and
colors. The second part propagates the key frames to other frames with
temporal-aware patch matching and frame blending. Our framework achieves global
style and local texture temporal consistency at a low cost (without re-training
or optimization). The adaptation is compatible with existing image diffusion
techniques, allowing our framework to take advantage of them, such as
customizing a specific subject with LoRA, and introducing extra spatial
guidance with ControlNet. Extensive experimental results demonstrate the
effectiveness of our proposed framework over existing methods in rendering
high-quality and temporally-coherent videos.Comment: Accepted to SIGGRAPH Asia 2023. Project page:
https://www.mmlab-ntu.com/project/rerender
Recommended from our members
A Novel Inpainting Framework for Virtual View Synthesis
Multi-view imaging has stimulated significant research to enhance the user experience of free viewpoint video, allowing interactive navigation between views and the freedom to select a desired view to watch. This usually involves transmitting both textural and depth information captured from different viewpoints to the receiver, to enable the synthesis of an arbitrary view. In rendering these virtual views, perceptual holes can appear due to certain regions, hidden in the original view by a closer object, becoming visible in the virtual view. To provide a high quality experience these holes must be filled in a visually plausible way, in a process known as inpainting. This is challenging because the missing information is generally unknown and the hole-regions can be large. Recently depth-based inpainting techniques have been proposed to address this challenge and while these generally perform better than non-depth assisted methods, they are not very robust and can produce perceptual artefacts.
This thesis presents a new inpainting framework that innovatively exploits depth and textural self-similarity characteristics to construct subjectively enhanced virtual viewpoints. The framework makes three significant contributions to the field: i) the exploitation of view information to jointly inpaint textural and depth hole regions; ii) the introduction of the novel concept of self-similarity characterisation which is combined with relevant depth information; and iii) an advanced self-similarity characterising scheme that automatically determines key spatial transform parameters for effective and flexible inpainting.
The presented inpainting framework has been critically analysed and shown to provide superior performance both perceptually and numerically compared to existing techniques, especially in terms of lower visual artefacts. It provides a flexible robust framework to develop new inpainting strategies for the next generation of interactive multi-view technologies
MAT: Mask-Aware Transformer for Large Hole Image Inpainting
Recent studies have shown the importance of modeling long-range interactions
in the inpainting problem. To achieve this goal, existing approaches exploit
either standalone attention techniques or transformers, but usually under a low
resolution in consideration of computational cost. In this paper, we present a
novel transformer-based model for large hole inpainting, which unifies the
merits of transformers and convolutions to efficiently process high-resolution
images. We carefully design each component of our framework to guarantee the
high fidelity and diversity of recovered images. Specifically, we customize an
inpainting-oriented transformer block, where the attention module aggregates
non-local information only from partial valid tokens, indicated by a dynamic
mask. Extensive experiments demonstrate the state-of-the-art performance of the
new model on multiple benchmark datasets. Code is released at
https://github.com/fenglinglwb/MAT.Comment: Accepted to CVPR2022 Ora
Methods for 3D Geometry Processing in the Cultural Heritage Domain
This thesis presents methods for 3D geometry processing under the aspects of cultural heritage applications. After a short overview over the relevant basics in 3D geometry processing, the present thesis investigates the digital acquisition of 3D models. A particular challenge in this context are on the one hand difficult surface or material properties of the model to be captured. On the other hand, the fully automatic reconstruction of models even with suitable surface properties that can be captured with Laser range scanners is not yet completely solved. This thesis presents two approaches to tackle these challenges. One exploits a thorough capture of the objectâs appearance and a coarse reconstruction for a concise and realistic object representation even for objects with problematic surface properties like reflectivity and transparency. The other method concentrates on digitisation via Laser-range scanners and exploits 2D colour images that are typically recorded with the range images for a fully automatic registration technique. After reconstruction, the captured models are often still incomplete, exhibit holes and/or regions of insufficient sampling. In addition to that, holes are often deliberately introduced into a registered model to remove some undesired or defective surface part. In order to produce a visually appealing model, for instance for visualisation purposes, for prototype or replica production, these holes have to be detected and filled. Although completion is a well-established research field in 2D image processing and many approaches do exist for image completion, surface completion in 3D is a fairly new field of research. This thesis presents a hierarchical completion approach that employs and extends successful exemplar-based 2D image processing approaches to 3D and fills in detail-equipped surface patches into missing surface regions. In order to identify and construct suitable surface patches, selfsimilarity and coherence properties of the surface context of the hole are exploited. In addition to the reconstruction and repair, the present thesis also investigates methods for a modification of captured models via interactive modelling. In this context, modelling is regarded as a creative process, for instance for animation purposes. On the other hand, it is also demonstrated how this creative process can be used to introduce human expertise into the otherwise automatic completion process. This way, reconstructions are feasible even of objects where already the data source, the object itself, is incomplete due to corrosion, demolition, or decay.Methoden zur 3D-Geometrieverarbeitung im Kulturerbesektor In dieser Arbeit werden Methoden zur Bearbeitung von digitaler 3D-Geometrie unter besonderer BerĂŒcksichtigung des Anwendungsbereichs im Kulturerbesektor vorgestellt. Nach einem kurzen Ăberblick ĂŒber die relevanten Grundlagen der dreidimensionalen Geometriebehandlung wird zunĂ€chst die digitale Akquise von dreidimensionalen Objekten untersucht. Eine besondere Herausforderung stellen bei der Erfassung einerseits ungĂŒnstige OberflĂ€chen- oder Materialeigenschaften der Objekte dar (wie z.B. ReflexivitĂ€t oder Transparenz), andererseits ist auch die vollautomatische Rekonstruktion von solchen Modellen, die sich verhĂ€ltnismĂ€Ăig problemlos mit Laser-Range Scannern erfassen lassen, immer noch nicht vollstĂ€ndig gelöst. Daher bilden zwei neuartige Verfahren, die diesen Herausforderungen begegnen, den Anfang. Auch nach der Registrierung sind die erfassten DatensĂ€tze in vielen FĂ€llen unvollstĂ€ndig, weisen Löcher oder nicht ausreichend abgetastete Regionen auf. DarĂŒber hinaus werden in vielen Anwendungen auch, z.B. durch Entfernen unerwĂŒnschter OberflĂ€chenregionen, Löcher gewollt hinzugefĂŒgt. FĂŒr eine optisch ansprechende Rekonstruktion, vor allem zu Visualisierungszwecken, im Bildungs- oder Unterhaltungssektor oder zur Prototyp- und Replik-Erzeugung mĂŒssen diese Löcher zunĂ€chst automatisch detektiert und anschlieĂend geschlossen werden. Obwohl dies im zweidimensionalen Fall der Bildbearbeitung bereits ein gut untersuchtes Forschungsfeld darstellt und vielfĂ€ltige AnsĂ€tze zur automatischen BildvervollstĂ€ndigung existieren, ist die Lage im dreidimensionalen Fall anders, und die Ăbertragung von zweidimensionalen AnsĂ€tzen in den 3D stellt vielfach eine groĂe Herausforderung dar, die bislang keine zufriedenstellenden Lösungen erlaubt hat. Nichtsdestoweniger wird in dieser Arbeit ein hierarchisches Verfahren vorgestellt, das beispielbasierte Konzepte aus dem 2D aufgreift und Löcher in OberflĂ€chen im 3D unter Ausnutzung von SelbstĂ€hnlichkeiten und KohĂ€renzeigenschaften des OberflĂ€chenkontextes schlieĂt. Um plausible OberflĂ€chen zu erzeugen werden die Löcher dabei nicht nur glatt gefĂŒllt, sondern auch feinere Details aus dem Kontext rekonstruiert. AbschlieĂend untersucht die vorliegende Arbeit noch die Modifikation der vervollstĂ€ndigten Objekte durch Freiformmodellierung. Dies wird dabei zum einen als kreativer Prozess z.B. zu Animationszwecken betrachtet. Zum anderen wird aber auch untersucht, wie dieser kreative Prozess benutzt werden kann, um etwaig vorhandenes Expertenwissen in die ansonsten automatische VervollstĂ€ndigung mit einflieĂen zu lassen. Auf diese Weise werden auch Rekonstruktionen ermöglicht von Objekten, bei denen schon die Datenquelle, also das Objekt selbst z.B. durch Korrosion oder mutwillige Zerstörung unvollstĂ€ndig ist
- âŠ