2,286 research outputs found
Controlling Style and Semantics in Weakly-Supervised Image Generation
We propose a weakly-supervised approach for conditional image generation of
complex scenes where a user has fine control over objects appearing in the
scene. We exploit sparse semantic maps to control object shapes and classes, as
well as textual descriptions or attributes to control both local and global
style. In order to condition our model on textual descriptions, we introduce a
semantic attention module whose computational cost is independent of the image
resolution. To further augment the controllability of the scene, we propose a
two-step generation scheme that decomposes background and foreground. The label
maps used to train our model are produced by a large-vocabulary object
detector, which enables access to unlabeled data and provides structured
instance information. In such a setting, we report better FID scores compared
to fully-supervised settings where the model is trained on ground-truth
semantic maps. We also showcase the ability of our model to manipulate a scene
on complex datasets such as COCO and Visual Genome.Comment: European Conference on Computer Vision (ECCV) 2020, Spotlight. Code
at https://github.com/dariopavllo/style-semantic
Visual Programming for Text-to-Image Generation and Evaluation
As large language models have demonstrated impressive performance in many
domains, recent works have adopted language models (LMs) as controllers of
visual modules for vision-and-language tasks. While existing work focuses on
equipping LMs with visual understanding, we propose two novel
interpretable/explainable visual programming frameworks for text-to-image (T2I)
generation and evaluation. First, we introduce VPGen, an interpretable
step-by-step T2I generation framework that decomposes T2I generation into three
steps: object/count generation, layout generation, and image generation. We
employ an LM to handle the first two steps (object/count generation and layout
generation), by finetuning it on text-layout pairs. Our step-by-step T2I
generation framework provides stronger spatial control than end-to-end models,
the dominant approach for this task. Furthermore, we leverage the world
knowledge of pretrained LMs, overcoming the limitation of previous
layout-guided T2I works that can only handle predefined object classes. We
demonstrate that our VPGen has improved control in counts/spatial
relations/scales of objects than state-of-the-art T2I generation models.
Second, we introduce VPEval, an interpretable and explainable evaluation
framework for T2I generation based on visual programming. Unlike previous T2I
evaluations with a single scoring model that is accurate in some skills but
unreliable in others, VPEval produces evaluation programs that invoke a set of
visual modules that are experts in different skills, and also provides
visual+textual explanations of the evaluation results. Our analysis shows
VPEval provides a more human-correlated evaluation for skill-specific and
open-ended prompts than widely used single model-based evaluation. We hope our
work encourages future progress on interpretable/explainable generation and
evaluation for T2I models. Website: https://vp-t2i.github.ioComment: 18 pages; Project website: https://vp-t2i.github.i
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Large-scale diffusion models have achieved state-of-the-art results on
text-to-image synthesis (T2I) tasks. Despite their ability to generate
high-quality yet creative images, we observe that attribution-binding and
compositional capabilities are still considered major challenging issues,
especially when involving multiple objects. In this work, we improve the
compositional skills of T2I models, specifically more accurate attribute
binding and better image compositions. To do this, we incorporate linguistic
structures with the diffusion guidance process based on the controllable
properties of manipulating cross-attention layers in diffusion-based T2I
models. We observe that keys and values in cross-attention layers have strong
semantic meanings associated with object layouts and content. Therefore, we can
better preserve the compositional semantics in the generated image by
manipulating the cross-attention representations based on linguistic insights.
Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention
design is efficient that requires no additional training samples. We achieve
better compositional skills in qualitative and quantitative results, leading to
a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an
in-depth analysis to reveal potential causes of incorrect image compositions
and justify the properties of cross-attention layers in the generation process
- …