By comparing the original and target prompts in editing task, we can obtain
numerous editing pairs, each comprising an object and its corresponding editing
target. To allow editability while maintaining fidelity to the input image,
existing editing methods typically involve a fixed number of inversion steps
that project the whole input image to its noisier latent representation,
followed by a denoising process guided by the target prompt. However, we find
that the optimal number of inversion steps for achieving ideal editing results
varies significantly among different editing pairs, owing to varying editing
difficulties. Therefore, the current literature, which relies on a fixed number
of inversion steps, produces sub-optimal generation quality, especially when
handling multiple editing pairs in a natural image. To this end, we propose a
new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR),
to enable object-level fine-grained editing. Specifically, we design a new
search metric, which determines the optimal inversion steps for each editing
pair, by jointly considering the editability of the target and the fidelity of
the non-editing region. We use our search metric to find the optimal inversion
step for each editing pair when editing an image. We then edit these editing
pairs separately to avoid concept mismatch. Subsequently, we propose an
additional reassembly step to seamlessly integrate the respective editing
results and the non-editing region to obtain the final edited image. To
systematically evaluate the effectiveness of our method, we collect two
datasets for benchmarking single- and multi-object editing, respectively.
Experiments demonstrate that our method achieves superior performance in
editing object shapes, colors, materials, categories, etc., especially in
multi-object editing scenarios.Comment: Project Page: https://aim-uofa.github.io/OIR-Diffusion