Representing wild sounds as images is an important but challenging task due
to the lack of paired datasets between sound and images and the significant
differences in the characteristics of these two modalities. Previous studies
have focused on generating images from sound in limited categories or music. In
this paper, we propose a novel approach to generate images from in-the-wild
sounds. First, we convert sound into text using audio captioning. Second, we
propose audio attention and sentence attention to represent the rich
characteristics of sound and visualize the sound. Lastly, we propose a direct
sound optimization with CLIPscore and AudioCLIP and generate images with a
diffusion-based model. In experiments, it shows that our model is able to
generate high quality images from wild sounds and outperforms baselines in both
quantitative and qualitative evaluations on wild audio datasets.Comment: Accepted to ICCV 202