14 research outputs found
Sketch-Guided Text-to-Image Diffusion Models
Text-to-Image models have introduced a remarkable leap in the evolution of
machine learning, demonstrating high-quality synthesis of images from a given
text-prompt. However, these powerful pretrained models still lack control
handles that can guide spatial properties of the synthesized images. In this
work, we introduce a universal approach to guide a pretrained text-to-image
diffusion model, with a spatial map from another domain (e.g., sketch) during
inference time. Unlike previous works, our method does not require to train a
dedicated model or a specialized encoder for the task. Our key idea is to train
a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron
(MLP) that maps latent features of noisy images to spatial maps, where the deep
features are extracted from the core Denoising Diffusion Probabilistic Model
(DDPM) network. The LGP is trained only on a few thousand images and
constitutes a differential guiding map predictor, over which the loss is
computed and propagated back to push the intermediate images to agree with the
spatial map. The per-pixel training offers flexibility and locality which
allows the technique to perform well on out-of-domain sketches, including
free-hand style drawings. We take a particular focus on the sketch-to-image
translation task, revealing a robust and expressive way to generate images that
follow the guidance of a sketch of arbitrary style or domain. Project page:
sketch-guided-diffusion.github.i
: Extended Textual Conditioning in Text-to-Image Generation
We introduce an Extended Textual Conditioning space in text-to-image models,
referred to as . This space consists of multiple textual conditions,
derived from per-layer prompts, each corresponding to a layer of the denoising
U-net of the diffusion model.
We show that the extended space provides greater disentangling and control
over image synthesis. We further introduce Extended Textual Inversion (XTI),
where the images are inverted into , and represented by per-layer tokens.
We show that XTI is more expressive and precise, and converges faster than
the original Textual Inversion (TI) space. The extended inversion method does
not involve any noticeable trade-off between reconstruction and editability and
induces more regular inversions.
We conduct a series of extensive experiments to analyze and understand the
properties of the new space, and to showcase the effectiveness of our method
for personalizing text-to-image models. Furthermore, we utilize the unique
properties of this space to achieve previously unattainable results in
object-style mixing using text-to-image models. Project page:
https://prompt-plus.github.i
Style Aligned Image Generation via Shared Attention
Large-scale Text-to-Image (T2I) models have rapidly gained prominence across
creative fields, generating visually compelling outputs from textual prompts.
However, controlling these models to ensure consistent style remains
challenging, with existing methods necessitating fine-tuning and manual
intervention to disentangle content and style. In this paper, we introduce
StyleAligned, a novel technique designed to establish style alignment among a
series of generated images. By employing minimal `attention sharing' during the
diffusion process, our method maintains style consistency across images within
T2I models. This approach allows for the creation of style-consistent images
using a reference style through a straightforward inversion operation. Our
method's evaluation across diverse styles and text prompts demonstrates
high-quality synthesis and fidelity, underscoring its efficacy in achieving
consistent style across various inputs.Comment: Project page at style-aligned-gen.github.i
AnyLens: A Generative Diffusion Model with Any Rendering Lens
State-of-the-art diffusion models can generate highly realistic images based
on various conditioning like text, segmentation, and depth. However, an
essential aspect often overlooked is the specific camera geometry used during
image capture. The influence of different optical systems on the final scene
appearance is frequently overlooked. This study introduces a framework that
intimately integrates a text-to-image diffusion model with the particular lens
geometry used in image rendering. Our method is based on a per-pixel coordinate
conditioning method, enabling the control over the rendering geometry. Notably,
we demonstrate the manipulation of curvature properties, achieving diverse
visual effects, such as fish-eye, panoramic views, and spherical texturing
using a single diffusion model
PALP: Prompt Aligned Personalization of Text-to-Image Models
Content creators often aim to create personalized images using personal
subjects that go beyond the capabilities of conventional text-to-image models.
Additionally, they may want the resulting image to encompass a specific
location, style, ambiance, and more. Existing personalization methods may
compromise personalization ability or the alignment to complex textual prompts.
This trade-off can impede the fulfillment of user prompts and subject fidelity.
We propose a new approach focusing on personalization methods for a
\emph{single} prompt to address this issue. We term our approach prompt-aligned
personalization. While this may seem restrictive, our method excels in
improving text alignment, enabling the creation of images with complex and
intricate prompts, which may pose a challenge for current techniques. In
particular, our method keeps the personalized model aligned with a target
prompt using an additional score distillation sampling term. We demonstrate the
versatility of our method in multi- and single-shot settings and further show
that it can compose multiple subjects or use inspiration from reference images,
such as artworks. We compare our approach quantitatively and qualitatively with
existing baselines and state-of-the-art techniques.Comment: Project page available at https://prompt-aligned.github.io