26 research outputs found
Controllable Image Generation via Collage Representations
Recent advances in conditional generative image models have enabled
impressive results. On the one hand, text-based conditional models have
achieved remarkable generation quality, by leveraging large-scale datasets of
image-text pairs. To enable fine-grained controllability, however, text-based
models require long prompts, whose details may be ignored by the model. On the
other hand, layout-based conditional models have also witnessed significant
advances. These models rely on bounding boxes or segmentation maps for precise
spatial conditioning in combination with coarse semantic labels. The semantic
labels, however, cannot be used to express detailed appearance characteristics.
In this paper, we approach fine-grained scene controllability through image
collages which allow a rich visual description of the desired scene as well as
the appearance and location of the objects therein, without the need of class
nor attribute labels. We introduce "mixing and matching scenes" (M&Ms), an
approach that consists of an adversarially trained generative image model which
is conditioned on appearance features and spatial positions of the different
elements in a collage, and integrates these into a coherent image. We train our
model on the OpenImages (OI) dataset and evaluate it on collages derived from
OI and MS-COCO datasets. Our experiments on the OI dataset show that M&Ms
outperforms baselines in terms of fine-grained scene controllability while
being very competitive in terms of image quality and sample diversity. On the
MS-COCO dataset, we highlight the generalization ability of our model by
outperforming DALL-E in terms of the zero-shot FID metric, despite using two
magnitudes fewer parameters and data. Collage based generative models have the
potential to advance content creation in an efficient and effective way as they
are intuitive to use and yield high quality generations
Enhancing Person Synthesis in Complex Scenes via Intrinsic and Contextual Structure Modeling
The Generative Adversarial Network (GAN) and its variations have enabled high quality image generation. However, generating reasonable persons in complex scenes (such as MS-COCO images) remains challenging. We propose a novel structure-based and context-aware approach to enhance the person synthesis in complex scenes. The method can success fully predict the person pose and face structures while respecting the weak layout-based context, then leverage the structures to refine the person appearance. Our method involves three parts. First, a memory-based model is used to encode person intrinsic structures including pose and face key points. Second, a context-aware model infers the conditional person structures from the layout context. Third, the structure-guided personappearancerefinersfurtherenhancethefinalimagegeneration.Ourexperiments present convincing person generation results in layout-to-image tasks on a challenging dataset. Person-related evaluations demonstrate our method achieves state-of-the-art performance, especially on person accuracy and face detection metrics
Auto-regressive Image Synthesis with Integrated Quantization
Deep generative models have achieved conspicuous progress in realistic image
synthesis with multifarious conditional inputs, while generating diverse yet
high-fidelity images remains a grand challenge in conditional image generation.
This paper presents a versatile framework for conditional image generation
which incorporates the inductive bias of CNNs and powerful sequence modeling of
auto-regression that naturally leads to diverse image generation. Instead of
independently quantizing the features of multiple domains as in prior research,
we design an integrated quantization scheme with a variational regularizer that
mingles the feature discretization in multiple domains, and markedly boosts the
auto-regressive modeling performance. Notably, the variational regularizer
enables to regularize feature distributions in incomparable latent spaces by
penalizing the intra-domain variations of distributions. In addition, we design
a Gumbel sampling strategy that allows to incorporate distribution uncertainty
into the auto-regressive training procedure. The Gumbel sampling substantially
mitigates the exposure bias that often incurs misalignment between the training
and inference stages and severely impairs the inference performance. Extensive
experiments over multiple conditional image generation tasks show that our
method achieves superior diverse image generation performance qualitatively and
quantitatively as compared with the state-of-the-art.Comment: Accepted to ECCV 2022 as Oral Presentatio
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models
The recent progress in diffusion-based text-to-image generation models has
significantly expanded generative capabilities via conditioning the text
descriptions. However, since relying solely on text prompts is still
restrictive for fine-grained customization, we aim to extend the boundaries of
conditional generation to incorporate diverse types of modalities, e.g.,
sketch, box, and style embedding, simultaneously. We thus design a multimodal
text-to-image diffusion model, coined as DiffBlender, that achieves the
aforementioned goal in a single model by training only a few small
hypernetworks. DiffBlender facilitates a convenient scaling of input
modalities, without altering the parameters of an existing large-scale
generative model to retain its well-established knowledge. Furthermore, our
study sets new standards for multimodal generation by conducting quantitative
and qualitative comparisons with existing approaches. By diversifying the
channels of conditioning modalities, DiffBlender faithfully reflects the
provided information or, in its absence, creates imaginative generation.Comment: 18 pages, 16 figures, and 3 table