In this paper, we present a novel generative task: joint scene graph - image
generation. While previous works have explored image generation conditioned on
scene graphs or layouts, our task is distinctive and important as it involves
generating scene graphs themselves unconditionally from noise, enabling
efficient and interpretable control for image generation. Our task is
challenging, requiring the generation of plausible scene graphs with
heterogeneous attributes for nodes (objects) and edges (relations among
objects), including continuous object bounding boxes and discrete object and
relation categories. We introduce a novel diffusion model, DiffuseSG, that
jointly models the adjacency matrix along with heterogeneous node and edge
attributes. We explore various types of encodings for the categorical data,
relaxing it into a continuous space. With a graph transformer being the
denoiser, DiffuseSG successively denoises the scene graph representation in a
continuous space and discretizes the final representation to generate the clean
scene graph. Additionally, we introduce an IoU regularization to enhance the
empirical performance. Our model significantly outperforms existing methods in
scene graph generation on the Visual Genome and COCO-Stuff datasets, both on
standard and newly introduced metrics that better capture the problem
complexity. Moreover, we demonstrate the additional benefits of our model in
two downstream applications: 1) excelling in a series of scene graph completion
tasks, and 2) improving scene graph detection models by using extra training
samples generated from DiffuseSG