154,792 research outputs found
PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing
Diffusion models have showcased their remarkable capability to synthesize
diverse and high-quality images, sparking interest in their application for
real image editing. However, existing diffusion-based approaches for local
image editing often suffer from undesired artifacts due to the pixel-level
blending of the noised target images and diffusion latent variables, which lack
the necessary semantics for maintaining image consistency. To address these
issues, we propose PFB-Diff, a Progressive Feature Blending method for
Diffusion-based image editing. Unlike previous methods, PFB-Diff seamlessly
integrates text-guided generated content into the target image through
multi-level feature blending. The rich semantics encoded in deep features and
the progressive blending scheme from high to low levels ensure semantic
coherence and high quality in edited images. Additionally, we introduce an
attention masking mechanism in the cross-attention layers to confine the impact
of specific words to desired regions, further improving the performance of
background editing. PFB-Diff can effectively address various editing tasks,
including object/background replacement and object attribute editing. Our
method demonstrates its superior performance in terms of image fidelity,
editing accuracy, efficiency, and faithfulness to the original image, without
the need for fine-tuning or training.Comment: 18 pages, 15 figure
Perceptually Meaningful Image Editing: Depth
We introduce the concept of perceptually meaningful image editing and present two techniques for manipulating the apparent depth of objects in an image. The user loads an image, selects an object and specifies whether the object should appear closer or further away. The system automatically determines target values for the object and/or background that achieve the desired depth change. These depth editing operations, based on techniques used by traditional artists, manipulate either the luminance or color temperature of different regions of the image. By performing blending in the gradient domain and reconstruction with a Poisson solver, the appearance of false edges is minimized. The results of a preliminary user study, designed to evaluate the effectiveness of these techniques, are also presented
Collaborative telemedicine for interactive multiuser segmentation of volumetric medical images
Telemedicine has evolved rapidly in recent years to enable unprecedented access to digital medical data, such as with networked image distribution/sharing and online (distant) collaborative diagnosis, largely due to the advances in telecommunication and multimedia technologies. However, interactive collaboration systems which control editing of an object among multiple users are often limited to a simple "locking” mechanism based on a conventional client/server architecture, where only one user edits the object which is located in a specific server, while all other users become viewers. Such systems fail to provide the needs of a modern day telemedicine applications that demand simultaneous editing of the medical data distributed in diverse local sites. In this study, we introduce a novel system for telemedicine applications, with its application to an interactive segmentation of volumetric medical images. We innovate by proposing a collaborative mechanism with a scalable data sharing architecture which makes users interactively edit on a single shared image scattered in local sites, thus enabling collaborative editing for, e.g., collaborative diagnosis, teaching, and training. We demonstrate our collaborative telemedicine mechanism with a prototype image editing system developed and evaluated with a user case study. Our result suggests that the ability for collaborative editing in a telemedicine context can be of great benefit and hold promising potential for further researc
PlenoPatch: patch-based plenoptic image manipulation
Patch-based image synthesis methods have been successfully applied for various editing tasks on still images, videos and stereo pairs. In this work we extend patch-based synthesis to plenoptic images captured by consumer-level lenselet-based devices for interactive, efficient light field editing. In our method the light field is represented as a set of images captured from different viewpoints. We decompose the central view into different depth layers, and present it to the user for specifying the editing goals. Given an editing task, our method performs patch-based image synthesis on all affected layers of the central view, and then propagates the edits to all other views. Interaction is done through a conventional 2D image editing user interface that is familiar to novice users. Our method correctly handles object boundary occlusion with semi-transparency, thus can generate more realistic results than previous methods. We demonstrate compelling results on a wide range of applications such as hole-filling, object reshuffling and resizing, changing object depth, light field upscaling and parallax magnification
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
This paper addresses the issue of modifying the visual appearance of videos
while preserving their motion. A novel framework, named MagicProp, is proposed,
which disentangles the video editing process into two stages: appearance
editing and motion-aware appearance propagation. In the first stage, MagicProp
selects a single frame from the input video and applies image-editing
techniques to modify the content and/or style of the frame. The flexibility of
these techniques enables the editing of arbitrary regions within the frame. In
the second stage, MagicProp employs the edited frame as an appearance reference
and generates the remaining frames using an autoregressive rendering approach.
To achieve this, a diffusion-based conditional generation model, called
PropDPM, is developed, which synthesizes the target frame by conditioning on
the reference appearance, the target motion, and its previous appearance. The
autoregressive editing approach ensures temporal consistency in the resulting
videos. Overall, MagicProp combines the flexibility of image-editing techniques
with the superior temporal consistency of autoregressive modeling, enabling
flexible editing of object types and aesthetic styles in arbitrary regions of
input videos while maintaining good temporal consistency across frames.
Extensive experiments in various video editing scenarios demonstrate the
effectiveness of MagicProp
SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
Object-centric learning aims to represent visual data with a set of object
entities (a.k.a. slots), providing structured representations that enable
systematic generalization. Leveraging advanced architectures like Transformers,
recent approaches have made significant progress in unsupervised object
discovery. In addition, slot-based representations hold great potential for
generative modeling, such as controllable image generation and object
manipulation in image editing. However, current slot-based methods often
produce blurry images and distorted objects, exhibiting poor generative
modeling capabilities. In this paper, we focus on improving slot-to-image
decoding, a crucial aspect for high-quality visual generation. We introduce
SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for
both image and video data. Thanks to the powerful modeling capacity of LDMs,
SlotDiffusion surpasses previous slot models in unsupervised object
segmentation and visual generation across six datasets. Furthermore, our
learned object features can be utilized by existing object-centric dynamics
models, improving video prediction quality and downstream temporal reasoning
tasks. Finally, we demonstrate the scalability of SlotDiffusion to
unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated
with self-supervised pre-trained image encoders.Comment: Project page: https://slotdiffusion.github.io/ . An earlier version
of this work appeared at the ICLR 2023 Workshop on Neurosymbolic Generative
Models: https://nesygems.github.io/assets/pdf/papers/SlotDiffusion.pd
A generative framework for image-based editing of material appearance using perceptual attributes
Single-image appearance editing is a challenging task, traditionally requiring the estimation of additional scene properties such as geometry or illumination. Moreover, the exact interaction of light, shape and material reflectance that elicits a given perceptual impression is still not well understood. We present an image-based editing method that allows to modify the material appearance of an object by increasing or decreasing high-level perceptual attributes, using a single image as input. Our framework relies on a two-step generative network, where the first step drives the change in appearance and the second produces an image with high-frequency details. For training, we augment an existing material appearance dataset with perceptual judgements of high-level attributes, collected through crowd-sourced experiments, and build upon training strategies that circumvent the cumbersome need for original-edited image pairs. We demonstrate the editing capabilities of our framework on a variety of inputs, both synthetic and real, using two common perceptual attributes (Glossy and Metallic), and validate the perception of appearance in our edited images through a user study
InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions
Recent works have explored text-guided image editing using diffusion models
and generated edited images based on text prompts. However, the models struggle
to accurately locate the regions to be edited and faithfully perform precise
edits. In this work, we propose a framework termed InstructEdit that can do
fine-grained editing based on user instructions. Our proposed framework has
three components: language processor, segmenter, and image editor. The first
component, the language processor, processes the user instruction using a large
language model. The goal of this processing is to parse the user instruction
and output prompts for the segmenter and captions for the image editor. We
adopt ChatGPT and optionally BLIP2 for this step. The second component, the
segmenter, uses the segmentation prompt provided by the language processor. We
employ a state-of-the-art segmentation framework Grounded Segment Anything to
automatically generate a high-quality mask based on the segmentation prompt.
The third component, the image editor, uses the captions from the language
processor and the masks from the segmenter to compute the edited image. We
adopt Stable Diffusion and the mask-guided generation from DiffEdit for this
purpose. Experiments show that our method outperforms previous editing methods
in fine-grained editing applications where the input image contains a complex
object or multiple objects. We improve the mask quality over DiffEdit and thus
improve the quality of edited images. We also show that our framework can
accept multiple forms of user instructions as input. We provide the code at
https://github.com/QianWangX/InstructEdit.Comment: Project page: https://qianwangx.github.io/InstructEdit
- …