Text-guided diffusion models have revolutionized image generation and
editing, offering exceptional realism and diversity. Specifically, in the
context of diffusion-based editing, where a source image is edited according to
a target prompt, the process commences by acquiring a noisy latent vector
corresponding to the source image via the diffusion model. This vector is
subsequently fed into separate source and target diffusion branches for
editing. The accuracy of this inversion process significantly impacts the final
editing outcome, influencing both essential content preservation of the source
image and edit fidelity according to the target prompt. Prior inversion
techniques aimed at finding a unified solution in both the source and target
diffusion branches. However, our theoretical and empirical analyses reveal that
disentangling these branches leads to a distinct separation of responsibilities
for preserving essential content and ensuring edit fidelity. Building on this
insight, we introduce "Direct Inversion," a novel technique achieving optimal
performance of both branches with just three lines of code. To assess image
editing performance, we present PIE-Bench, an editing benchmark with 700 images
showcasing diverse scenes and editing types, accompanied by versatile
annotations and comprehensive evaluation metrics. Compared to state-of-the-art
optimization-based inversion techniques, our solution not only yields superior
performance across 8 editing methods but also achieves nearly an order of
speed-up