Despite their exceptional generative abilities, large text-to-image diffusion
models, much like skilled but careless artists, often struggle with accurately
depicting visual relationships between objects. This issue, as we uncover
through careful analysis, arises from a misaligned text encoder that struggles
to interpret specific relationships and differentiate the logical order of
associated objects. To resolve this, we introduce a novel task termed Relation
Rectification, aiming to refine the model to accurately represent a given
relationship it initially fails to generate. To address this, we propose an
innovative solution utilizing a Heterogeneous Graph Convolutional Network
(HGCN). It models the directional relationships between relation terms and
corresponding objects within the input prompts. Specifically, we optimize the
HGCN on a pair of prompts with identical relational words but reversed object
orders, supplemented by a few reference images. The lightweight HGCN adjusts
the text embeddings generated by the text encoder, ensuring the accurate
reflection of the textual relation in the embedding space. Crucially, our
method retains the parameters of the text encoder and diffusion model,
preserving the model's robust performance on unrelated descriptions. We
validated our approach on a newly curated dataset of diverse relational data,
demonstrating both quantitative and qualitative enhancements in generating
images with precise visual relations. Project page:
https://wuyinwei-hah.github.io/rrnet.github.io/