From Text to Mask: Localizing Entities Using the Attention of
  Text-to-Image Diffusion Models

Xiao, Changming; Yang, Qi; Zhang, Changshui; Zhou, Feng

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Authors: Changming Xiao
Qi Yang
Changshui Zhang
Feng Zhou
Publication date: 8 September 2023
Publisher

Abstract

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2309.04109

Last time updated on 06/10/2023