Diffusion models have revolted the field of text-to-image generation
recently. The unique way of fusing text and image information contributes to
their remarkable capability of generating highly text-related images. From
another perspective, these generative models imply clues about the precise
correlation between words and pixels. In this work, a simple but effective
method is proposed to utilize the attention mechanism in the denoising network
of text-to-image diffusion models. Without re-training nor inference-time
optimization, the semantic grounding of phrases can be attained directly. We
evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under
weakly-supervised semantic segmentation setting and our method achieves
superior performance to prior methods. In addition, the acquired word-pixel
correlation is found to be generalizable for the learned text embedding of
customized generation methods, requiring only a few modifications. To validate
our discovery, we introduce a new practical task called "personalized referring
image segmentation" with a new dataset. Experiments in various situations
demonstrate the advantages of our method compared to strong baselines on this
task. In summary, our work reveals a novel way to extract the rich multi-modal
knowledge hidden in diffusion models for segmentation