5 research outputs found
ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior Constraints
Recent text-to-image generative models have enabled us to transform our words
into vibrant, captivating imagery. The surge of personalization techniques that
has followed has also allowed us to imagine unique concepts in new scenes.
However, an intriguing question remains: How can we generate a new, imaginary
concept that has never been seen before? In this paper, we present the task of
creative text-to-image generation, where we seek to generate new members of a
broad category (e.g., generating a pet that differs from all existing pets). We
leverage the under-studied Diffusion Prior models and show that the creative
generation problem can be formulated as an optimization process over the output
space of the diffusion prior, resulting in a set of "prior constraints". To
keep our generated concept from converging into existing members, we
incorporate a question-answering Vision-Language Model (VLM) that adaptively
adds new constraints to the optimization problem, encouraging the model to
discover increasingly more unique creations. Finally, we show that our prior
constraints can also serve as a strong mixing mechanism allowing us to create
hybrids between generated concepts, introducing even more flexibility into the
creative process.Comment: Project page: https://kfirgoldberg.github.io/ConceptLab
A Neural Space-Time Representation for Text-to-Image Personalization
A key aspect of text-to-image personalization methods is the manner in which
the target concept is represented within the generative process. This choice
greatly affects the visual fidelity, downstream editability, and disk space
needed to store the learned concept. In this paper, we explore a new
text-conditioning space that is dependent on both the denoising process
timestep (time) and the denoising U-Net layers (space) and showcase its
compelling properties. A single concept in the space-time representation is
composed of hundreds of vectors, one for each combination of time and space,
making this space challenging to optimize directly. Instead, we propose to
implicitly represent a concept in this space by optimizing a small neural
mapper that receives the current time and space parameters and outputs the
matching token embedding. In doing so, the entire personalized concept is
represented by the parameters of the learned mapper, resulting in a compact,
yet expressive, representation. Similarly to other personalization methods, the
output of our neural mapper resides in the input space of the text encoder. We
observe that one can significantly improve the convergence and visual fidelity
of the concept by introducing a textual bypass, where our neural mapper
additionally outputs a residual that is added to the output of the text
encoder. Finally, we show how one can impose an importance-based ordering over
our implicit representation, providing users control over the reconstruction
and editability of the learned concept using a single trained model. We
demonstrate the effectiveness of our approach over a range of concepts and
prompts, showing our method's ability to generate high-quality and controllable
compositions without fine-tuning any parameters of the generative model itself.Comment: Project page available at
https://neuraltextualinversion.github.io/NeTI
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Text-to-image models offer unprecedented freedom to guide creation through
natural language. Yet, it is unclear how such freedom can be exercised to
generate images of specific unique concepts, modify their appearance, or
compose them in new roles and novel scenes. In other words, we ask: how can we
use language-guided models to turn our cat into a painting, or imagine a new
product based on our favorite toy? Here we present a simple approach that
allows such creative freedom. Using only 3-5 images of a user-provided concept,
like an object or a style, we learn to represent it through new "words" in the
embedding space of a frozen text-to-image model. These "words" can be composed
into natural language sentences, guiding personalized creation in an intuitive
way. Notably, we find evidence that a single word embedding is sufficient for
capturing unique and varied concepts. We compare our approach to a wide range
of baselines, and demonstrate that it can more faithfully portray the concepts
across a range of applications and tasks.
Our code, data and new words will be available at:
https://textual-inversion.github.ioComment: Project page: https://textual-inversion.github.i
MyVLM: Personalizing VLMs for User-Specific Queries
Recent large-scale vision-language models (VLMs) have demonstrated remarkable
capabilities in understanding and generating textual descriptions for visual
content. However, these models lack an understanding of user-specific concepts.
In this work, we take a first step toward the personalization of VLMs, enabling
them to learn and reason over user-provided concepts. For example, we explore
whether these models can learn to recognize you in an image and communicate
what you are doing, tailoring the model to reflect your personal experiences
and relationships. To effectively recognize a variety of user-specific
concepts, we augment the VLM with external concept heads that function as
toggles for the model, enabling the VLM to identify the presence of specific
target concepts in a given image. Having recognized the concept, we learn a new
concept embedding in the intermediate feature space of the VLM. This embedding
is tasked with guiding the language model to naturally integrate the target
concept in its generated response. We apply our technique to BLIP-2 and LLaVA
for personalized image captioning and further show its applicability for
personalized visual question-answering. Our experiments demonstrate our ability
to generalize to unseen images of learned concepts while preserving the model
behavior on unrelated inputs.Comment: Project page: https://snap-research.github.io/MyVLM