Diffusion models have showcased their remarkable capability to synthesize
diverse and high-quality images, sparking interest in their application for
real image editing. However, existing diffusion-based approaches for local
image editing often suffer from undesired artifacts due to the pixel-level
blending of the noised target images and diffusion latent variables, which lack
the necessary semantics for maintaining image consistency. To address these
issues, we propose PFB-Diff, a Progressive Feature Blending method for
Diffusion-based image editing. Unlike previous methods, PFB-Diff seamlessly
integrates text-guided generated content into the target image through
multi-level feature blending. The rich semantics encoded in deep features and
the progressive blending scheme from high to low levels ensure semantic
coherence and high quality in edited images. Additionally, we introduce an
attention masking mechanism in the cross-attention layers to confine the impact
of specific words to desired regions, further improving the performance of
background editing. PFB-Diff can effectively address various editing tasks,
including object/background replacement and object attribute editing. Our
method demonstrates its superior performance in terms of image fidelity,
editing accuracy, efficiency, and faithfulness to the original image, without
the need for fine-tuning or training.Comment: 18 pages, 15 figure