Attention on Attention for Text to Image Synthesis using Mode-Seeking Loss Function

Abstract

Text to Image Synthesis is a burgeoning field that has sprung up in the research community in the last few years. Generative Adversarial Networks form the basic component of modern research as many architectures are built around this particular type. This thesis is an attempt to develop a new technique and architecture to compete with the recent state-of-the-art models. The research around the text to image synthesis is conducted using two approaches by working with the modification of the loss and then the change in architecture. In the first approach, we sample two noise vectors to generate two different output images. The model is trained by a mode-seeking loss function which maximizes the ratio of the l2-norm of difference between the two images to the l2 norm of difference between noise vectors. The second approach deals with increasing the attention between the text and the image by introducing the Attention on Attention architecture in the AttnGAN network. We find that a combination of two approaches produces good quality images and better attended on their text descriptions. The metric scores of Frechet Inception Distance and Inception Scores are used to evaluate the results. The datasets used in this research are Microsoft COCO and CUB Birds. A comparison study of the obtained results with the past state-of-the-art models is conducted and presented in this thesis

    Similar works