Recent text-to-image (T2I) diffusion models have achieved remarkable progress
in generating high-quality images given text-prompts as input. However, these
models fail to convey appropriate spatial composition specified by a layout
instruction. In this work, we probe into zero-shot grounded T2I generation with
diffusion models, that is, generating images corresponding to the input layout
information without training auxiliary modules or finetuning diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach
that gradually modulates the attention maps of diffusion model during
generative process, and assists the model to synthesize images (1) with high
fidelity, (2) highly compatible with textual input, and (3) interpreting layout
instructions accurately. Specifically, we leverage the discrete sampling to
bridge the gap between consecutive attention maps and discrete layout
constraints, and design a region-aware loss to refine the generative layout
during diffusion process. We further propose a boundary-aware loss to
strengthen object discriminability within the corresponding regions.
Experimental results show that our method outperforms existing state-of-the-art
zero-shot grounded T2I generation methods by a large margin both qualitatively
and quantitatively on several benchmarks.Comment: Preprint. Under review. Project page:
https://sagileo.github.io/Region-and-Boundar