Recent advances in conditional generative image models have enabled
impressive results. On the one hand, text-based conditional models have
achieved remarkable generation quality, by leveraging large-scale datasets of
image-text pairs. To enable fine-grained controllability, however, text-based
models require long prompts, whose details may be ignored by the model. On the
other hand, layout-based conditional models have also witnessed significant
advances. These models rely on bounding boxes or segmentation maps for precise
spatial conditioning in combination with coarse semantic labels. The semantic
labels, however, cannot be used to express detailed appearance characteristics.
In this paper, we approach fine-grained scene controllability through image
collages which allow a rich visual description of the desired scene as well as
the appearance and location of the objects therein, without the need of class
nor attribute labels. We introduce "mixing and matching scenes" (M&Ms), an
approach that consists of an adversarially trained generative image model which
is conditioned on appearance features and spatial positions of the different
elements in a collage, and integrates these into a coherent image. We train our
model on the OpenImages (OI) dataset and evaluate it on collages derived from
OI and MS-COCO datasets. Our experiments on the OI dataset show that M&Ms
outperforms baselines in terms of fine-grained scene controllability while
being very competitive in terms of image quality and sample diversity. On the
MS-COCO dataset, we highlight the generalization ability of our model by
outperforming DALL-E in terms of the zero-shot FID metric, despite using two
magnitudes fewer parameters and data. Collage based generative models have the
potential to advance content creation in an efficient and effective way as they
are intuitive to use and yield high quality generations