205,961 research outputs found
Geometry Aligned Variational Transformer for Image-conditioned Layout Generation
Layout generation is a novel task in computer vision, which combines the
challenges in both object localization and aesthetic appraisal, widely used in
advertisements, posters, and slides design. An accurate and pleasant layout
should consider both the intra-domain relationship within layout elements and
the inter-domain relationship between layout elements and the image. However,
most previous methods simply focus on image-content-agnostic layout generation,
without leveraging the complex visual information from the image. To this end,
we explore a novel paradigm entitled image-conditioned layout generation, which
aims to add text overlays to an image in a semantically coherent manner.
Specifically, we propose an Image-Conditioned Variational Transformer (ICVT)
that autoregressively generates various layouts in an image. First,
self-attention mechanism is adopted to model the contextual relationship within
layout elements, while cross-attention mechanism is used to fuse the visual
information of conditional images. Subsequently, we take them as building
blocks of conditional variational autoencoder (CVAE), which demonstrates
appealing diversity. Second, in order to alleviate the gap between layout
elements domain and visual domain, we design a Geometry Alignment module, in
which the geometric information of the image is aligned with the layout
representation. In addition, we construct a large-scale advertisement poster
layout designing dataset with delicate layout and saliency map annotations.
Experimental results show that our model can adaptively generate layouts in the
non-intrusive area of the image, resulting in a harmonious layout design.Comment: To be published in ACM MM 202
Street-View Image Generation from a Bird's-Eye View Layout
Bird's-Eye View (BEV) Perception has received increasing attention in recent
years as it provides a concise and unified spatial representation across views
and benefits a diverse set of downstream driving applications. While the focus
has been placed on discriminative tasks such as BEV segmentation, the dual
generative task of creating street-view images from a BEV layout has rarely
been explored. The ability to generate realistic street-view images that align
with a given HD map and traffic layout is critical for visualizing complex
traffic scenarios and developing robust perception models for autonomous
driving. In this paper, we propose BEVGen, a conditional generative model that
synthesizes a set of realistic and spatially consistent surrounding images that
match the BEV layout of a traffic scenario. BEVGen incorporates a novel
cross-view transformation and spatial attention design which learn the
relationship between cameras and map views to ensure their consistency. Our
model can accurately render road and lane lines, as well as generate traffic
scenes under different weather conditions and times of day. The code will be
made publicly available
Unifying Vision, Text, and Layout for Universal Document Processing
We propose Universal Document Processing (UDOP), a foundation Document AI
model which unifies text, image, and layout modalities together with varied
task formats, including document understanding and generation. UDOP leverages
the spatial correlation between textual content and document image to model
image, text, and layout modalities with one uniform representation. With a
novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain
downstream tasks into a prompt-based sequence generation scheme. UDOP is
pretrained on both large-scale unlabeled document corpora using innovative
self-supervised objectives and diverse labeled data. UDOP also learns to
generate document images from text and layout modalities via masked image
reconstruction. To the best of our knowledge, this is the first time in the
field of document AI that one model simultaneously achieves high-quality neural
document editing and content customization. Our method sets the
state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA,
across diverse data domains like finance reports, academic papers, and
websites. UDOP ranks first on the leaderboard of the Document Understanding
Benchmark.Comment: CVPR 202
ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation
Personalizing generative models offers a way to guide image generation with
user-provided references. Current personalization methods can invert an object
or concept into the textual conditioning space and compose new natural
sentences for text-to-image diffusion models. However, representing and editing
specific visual attributes like material, style, layout, etc. remains a
challenge, leading to a lack of disentanglement and editability. To address
this, we propose a novel approach that leverages the step-by-step generation
process of diffusion models, which generate images from low- to high-frequency
information, providing a new perspective on representing, generating, and
editing images. We develop Prompt Spectrum Space P*, an expanded textual
conditioning space, and a new image representation method called ProSpect.
ProSpect represents an image as a collection of inverted textual token
embeddings encoded from per-stage prompts, where each prompt corresponds to a
specific generation stage (i.e., a group of consecutive steps) of the diffusion
model. Experimental results demonstrate that P* and ProSpect offer stronger
disentanglement and controllability compared to existing methods. We apply
ProSpect in various personalized attribute-aware image generation applications,
such as image/text-guided material/style/layout transfer/editing, achieving
previously unattainable results with a single image input without fine-tuning
the diffusion models
LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts
Thanks to the rapid development of diffusion models, unprecedented progress
has been witnessed in image synthesis. Prior works mostly rely on pre-trained
linguistic models, but a text is often too abstract to properly specify all the
spatial properties of an image, e.g., the layout configuration of a scene,
leading to the sub-optimal results of complex scene generation. In this paper,
we achieve accurate complex scene generation by proposing a semantically
controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from
the previous Layout-to-Image generation (L2I) methods that only explore
category-aware relationships, LAW-Diffusion introduces a spatial dependency
parser to encode the location-aware semantic coherence across objects as a
layout embedding and produces a scene with perceptually harmonious object
styles and contextual relations. To be specific, we delicately instantiate each
object's regional semantics as an object region map and leverage a
location-aware cross-object attention module to capture the spatial
dependencies among those disentangled representations. We further propose an
adaptive guidance schedule for our layout guidance to mitigate the trade-off
between the regional semantic alignment and the texture fidelity of generated
objects. Moreover, LAW-Diffusion allows for instance reconfiguration while
maintaining the other regions in a synthesized image by introducing a
layout-aware latent grafting mechanism to recompose its local regional
semantics. To better verify the plausibility of generated scenes, we propose a
new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to
measure how the images preserve the rational and harmonious relations among
contextual objects. Comprehensive experiments demonstrate that our
LAW-Diffusion yields the state-of-the-art generative performance, especially
with coherent object relations
End-to-End Optimization of Scene Layout
We propose an end-to-end variational generative model for scene layout
synthesis conditioned on scene graphs. Unlike unconditional scene layout
generation, we use scene graphs as an abstract but general representation to
guide the synthesis of diverse scene layouts that satisfy relationships
included in the scene graph. This gives rise to more flexible control over the
synthesis process, allowing various forms of inputs such as scene layouts
extracted from sentences or inferred from a single color image. Using our
conditional layout synthesizer, we can generate various layouts that share the
same structure of the input example. In addition to this conditional generation
design, we also integrate a differentiable rendering module that enables layout
refinement using only 2D projections of the scene. Given a depth and a
semantics map, the differentiable rendering module enables optimizing over the
synthesized layout to fit the given input in an analysis-by-synthesis fashion.
Experiments suggest that our model achieves higher accuracy and diversity in
conditional scene synthesis and allows exemplar-based scene generation from
various input forms.Comment: CVPR 2020 (Oral). Project page: http://3dsln.csail.mit.edu
- …