53 research outputs found
Break-A-Scene: Extracting Multiple Concepts from a Single Image
Text-to-image model personalization aims to introduce a user-provided concept
to the model, allowing its synthesis in diverse contexts. However, current
methods primarily focus on the case of learning a single concept from multiple
images with variations in backgrounds and poses, and struggle when adapted to a
different scenario. In this work, we introduce the task of textual scene
decomposition: given a single image of a scene that may contain several
concepts, we aim to extract a distinct text token for each concept, enabling
fine-grained control over the generated scenes. To this end, we propose
augmenting the input image with masks that indicate the presence of target
concepts. These masks can be provided by the user or generated automatically by
a pre-trained segmentation model. We then present a novel two-phase
customization process that optimizes a set of dedicated textual embeddings
(handles), as well as the model weights, striking a delicate balance between
accurately capturing the concepts and avoiding overfitting. We employ a masked
diffusion loss to enable handles to generate their assigned concepts,
complemented by a novel loss on cross-attention maps to prevent entanglement.
We also introduce union-sampling, a training strategy aimed to improve the
ability of combining multiple concepts in generated images. We use several
automatic metrics to quantitatively compare our method against several
baselines, and further affirm the results using a user study. Finally, we
showcase several applications of our method. Project page is available at:
https://omriavrahami.com/break-a-scene/Comment: SIGGRAPH Asia 2023. Project page: at:
https://omriavrahami.com/break-a-scene/ Video:
https://www.youtube.com/watch?v=-9EA-Bhizg
PALP: Prompt Aligned Personalization of Text-to-Image Models
Content creators often aim to create personalized images using personal
subjects that go beyond the capabilities of conventional text-to-image models.
Additionally, they may want the resulting image to encompass a specific
location, style, ambiance, and more. Existing personalization methods may
compromise personalization ability or the alignment to complex textual prompts.
This trade-off can impede the fulfillment of user prompts and subject fidelity.
We propose a new approach focusing on personalization methods for a
\emph{single} prompt to address this issue. We term our approach prompt-aligned
personalization. While this may seem restrictive, our method excels in
improving text alignment, enabling the creation of images with complex and
intricate prompts, which may pose a challenge for current techniques. In
particular, our method keeps the personalized model aligned with a target
prompt using an additional score distillation sampling term. We demonstrate the
versatility of our method in multi- and single-shot settings and further show
that it can compose multiple subjects or use inspiration from reference images,
such as artworks. We compare our approach quantitatively and qualitatively with
existing baselines and state-of-the-art techniques.Comment: Project page available at https://prompt-aligned.github.io
Predicting human interruptibility with sensors, in
A person seeking someone else’s attention is normally able to quickly assess how interruptible they are. This assessment allows for behavior we perceive as natural, socially appropriate, or simply polite. On the other hand, today’s computer systems are almost entirely oblivious to the human world they operate in, and typically have no way to take into account the interruptibility of the user. This paper presents a Wizard of Oz study exploring whether, and how, robust sensor-based predictions of interruptibility might be constructed, which sensors might be most useful to such predictions, and how simple such sensors might be. The study simulates a range of possible sensors through human coding of audio and video recordings. Experience sampling is used to simultaneously collect randomly distributed self-reports of interruptibility. Based on these simulated sensors, we construct statistical models predicting human interruptibility and compare their predictions with the collected self-report data. The results of these models, although covering a demographically limited sample, are very promising, with the overall accuracy of several models reaching about 78%. Additionally, a model tuned to avoiding unwanted interruptions does so for 90 % of its predictions, while retaining 75 % overall accuracy
QnA: Augmenting an Instant Messaging Client to Balance User Responsiveness and Performance
The growing use of Instant Messaging for social and work-related communication has created a situation where incoming messages often become a distraction to users while they are performing important tasks. Staying on task at the expense of responsiveness to IM buddies may portray the users as impolite or even rude. Constantly attending to IM, on the other hand, may prevent users from performing tasks efficiently, leaving them frustrated. In this paper we present a tool that augments a commercial IM client by automatically increasing the salience of incoming messages that may deserve immediate attention, helping users decide whether or not to stay on task. Categories and Subject Descriptors H.5.3 [Information Interfaces and Presentation]: Group an
- …