57 research outputs found
Controllable Image Generation via Collage Representations
Recent advances in conditional generative image models have enabled
impressive results. On the one hand, text-based conditional models have
achieved remarkable generation quality, by leveraging large-scale datasets of
image-text pairs. To enable fine-grained controllability, however, text-based
models require long prompts, whose details may be ignored by the model. On the
other hand, layout-based conditional models have also witnessed significant
advances. These models rely on bounding boxes or segmentation maps for precise
spatial conditioning in combination with coarse semantic labels. The semantic
labels, however, cannot be used to express detailed appearance characteristics.
In this paper, we approach fine-grained scene controllability through image
collages which allow a rich visual description of the desired scene as well as
the appearance and location of the objects therein, without the need of class
nor attribute labels. We introduce "mixing and matching scenes" (M&Ms), an
approach that consists of an adversarially trained generative image model which
is conditioned on appearance features and spatial positions of the different
elements in a collage, and integrates these into a coherent image. We train our
model on the OpenImages (OI) dataset and evaluate it on collages derived from
OI and MS-COCO datasets. Our experiments on the OI dataset show that M&Ms
outperforms baselines in terms of fine-grained scene controllability while
being very competitive in terms of image quality and sample diversity. On the
MS-COCO dataset, we highlight the generalization ability of our model by
outperforming DALL-E in terms of the zero-shot FID metric, despite using two
magnitudes fewer parameters and data. Collage based generative models have the
potential to advance content creation in an efficient and effective way as they
are intuitive to use and yield high quality generations
Controllable Text-to-Image Generation with GPT-4
Current text-to-image generation models often struggle to follow textual
instructions, especially the ones requiring spatial reasoning. On the other
hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable
precision in generating code snippets for sketching out text inputs
graphically, e.g., via TikZ. In this work, we introduce Control-GPT to guide
the diffusion-based text-to-image pipelines with programmatic sketches
generated by GPT-4, enhancing their abilities for instruction following.
Control-GPT works by querying GPT-4 to write TikZ code, and the generated
sketches are used as references alongside the text instructions for diffusion
models (e.g., ControlNet) to generate photo-realistic images. One major
challenge to training our pipeline is the lack of a dataset containing aligned
text, images, and sketches. We address the issue by converting instance masks
in existing datasets into polygons to mimic the sketches used at test time. As
a result, Control-GPT greatly boosts the controllability of image generation.
It establishes a new state-of-art on the spatial arrangement and object
positioning generation and enhances users' control of object positions, sizes,
etc., nearly doubling the accuracy of prior models. Our work, as a first
attempt, shows the potential for employing LLMs to enhance the performance in
computer vision tasks
ALR-GAN: Adaptive Layout Refinement for Text-to-Image Synthesis
We propose a novel Text-to-Image Generation Network, Adaptive Layout
Refinement Generative Adversarial Network (ALR-GAN), to adaptively refine the
layout of synthesized images without any auxiliary information. The ALR-GAN
includes an Adaptive Layout Refinement (ALR) module and a Layout Visual
Refinement (LVR) loss. The ALR module aligns the layout structure (which refers
to locations of objects and background) of a synthesized image with that of its
corresponding real image. In ALR module, we proposed an Adaptive Layout
Refinement (ALR) loss to balance the matching of hard and easy features, for
more efficient layout structure matching. Based on the refined layout
structure, the LVR loss further refines the visual representation within the
layout area. Experimental results on two widely-used datasets show that ALR-GAN
performs competitively at the Text-to-Image generation task.Comment: Accepted by TM
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models
The recent progress in diffusion-based text-to-image generation models has
significantly expanded generative capabilities via conditioning the text
descriptions. However, since relying solely on text prompts is still
restrictive for fine-grained customization, we aim to extend the boundaries of
conditional generation to incorporate diverse types of modalities, e.g.,
sketch, box, and style embedding, simultaneously. We thus design a multimodal
text-to-image diffusion model, coined as DiffBlender, that achieves the
aforementioned goal in a single model by training only a few small
hypernetworks. DiffBlender facilitates a convenient scaling of input
modalities, without altering the parameters of an existing large-scale
generative model to retain its well-established knowledge. Furthermore, our
study sets new standards for multimodal generation by conducting quantitative
and qualitative comparisons with existing approaches. By diversifying the
channels of conditioning modalities, DiffBlender faithfully reflects the
provided information or, in its absence, creates imaginative generation.Comment: 18 pages, 16 figures, and 3 table
Auto-regressive Image Synthesis with Integrated Quantization
Deep generative models have achieved conspicuous progress in realistic image
synthesis with multifarious conditional inputs, while generating diverse yet
high-fidelity images remains a grand challenge in conditional image generation.
This paper presents a versatile framework for conditional image generation
which incorporates the inductive bias of CNNs and powerful sequence modeling of
auto-regression that naturally leads to diverse image generation. Instead of
independently quantizing the features of multiple domains as in prior research,
we design an integrated quantization scheme with a variational regularizer that
mingles the feature discretization in multiple domains, and markedly boosts the
auto-regressive modeling performance. Notably, the variational regularizer
enables to regularize feature distributions in incomparable latent spaces by
penalizing the intra-domain variations of distributions. In addition, we design
a Gumbel sampling strategy that allows to incorporate distribution uncertainty
into the auto-regressive training procedure. The Gumbel sampling substantially
mitigates the exposure bias that often incurs misalignment between the training
and inference stages and severely impairs the inference performance. Extensive
experiments over multiple conditional image generation tasks show that our
method achieves superior diverse image generation performance qualitatively and
quantitatively as compared with the state-of-the-art.Comment: Accepted to ECCV 2022 as Oral Presentatio
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
Spatial control is a core capability in controllable image generation.
Advancements in layout-guided image generation have shown promising results on
in-distribution (ID) datasets with similar spatial configurations. However, it
is unclear how these models perform when facing out-of-distribution (OOD)
samples with arbitrary, unseen layouts. In this paper, we propose LayoutBench,
a diagnostic benchmark for layout-guided image generation that examines four
categories of spatial control skills: number, position, size, and shape. We
benchmark two recent representative layout-guided image generation methods and
observe that the good ID layout control may not generalize well to arbitrary
layouts in the wild (e.g., objects at the boundary). Next, we propose
IterInpaint, a new baseline that generates foreground and background regions in
a step-by-step manner via inpainting, demonstrating stronger generalizability
than existing models on OOD layouts in LayoutBench. We perform quantitative and
qualitative evaluation and fine-grained analysis on the four LayoutBench skills
to pinpoint the weaknesses of existing models. Lastly, we show comprehensive
ablation studies on IterInpaint, including training task ratio, crop&paste vs.
repaint, and generation order. Project website: https://layoutbench.github.ioComment: 22 pages; Project website: https://layoutbench.github.i
Controlling Style and Semantics in Weakly-Supervised Image Generation
We propose a weakly-supervised approach for conditional image generation of
complex scenes where a user has fine control over objects appearing in the
scene. We exploit sparse semantic maps to control object shapes and classes, as
well as textual descriptions or attributes to control both local and global
style. In order to condition our model on textual descriptions, we introduce a
semantic attention module whose computational cost is independent of the image
resolution. To further augment the controllability of the scene, we propose a
two-step generation scheme that decomposes background and foreground. The label
maps used to train our model are produced by a large-vocabulary object
detector, which enables access to unlabeled data and provides structured
instance information. In such a setting, we report better FID scores compared
to fully-supervised settings where the model is trained on ground-truth
semantic maps. We also showcase the ability of our model to manipulate a scene
on complex datasets such as COCO and Visual Genome.Comment: European Conference on Computer Vision (ECCV) 2020, Spotlight. Code
at https://github.com/dariopavllo/style-semantic
- …