357,101 research outputs found
DTGAN:Dual Attention Generative Adversarial Networks for Text-to-Image Generation
Most existing text-to-image generation methods adopt a multi-stage modular
architecture which has three significant problems: 1) Training multiple
networks increases the run time and affects the convergence and stability of
the generative model; 2) These approaches ignore the quality of early-stage
generator images; 3) Many discriminators need to be trained. To this end, we
propose the Dual Attention Generative Adversarial Network (DTGAN) which can
synthesize high-quality and semantically consistent images only employing a
single generator/discriminator pair. The proposed model introduces
channel-aware and pixel-aware attention modules that can guide the generator to
focus on text-relevant channels and pixels based on the global sentence vector
and to fine-tune original feature maps using attention weights. Also,
Conditional Adaptive Instance-Layer Normalization (CAdaILN) is presented to
help our attention modules flexibly control the amount of change in shape and
texture by the input natural-language description. Furthermore, a new type of
visual loss is utilized to enhance the image resolution by ensuring vivid shape
and perceptually uniform color distributions of generated images. Experimental
results on benchmark datasets demonstrate the superiority of our proposed
method compared to the state-of-the-art models with a multi-stage framework.
Visualization of the attention maps shows that the channel-aware attention
module is able to localize the discriminative regions, while the pixel-aware
attention module has the ability to capture the globally visual contents for
the generation of an image
AMC: Attention guided Multi-modal Correlation Learning for Image Search
Given a user's query, traditional image search systems rank images according
to its relevance to a single modality (e.g., image content or surrounding
text). Nowadays, an increasing number of images on the Internet are available
with associated meta data in rich modalities (e.g., titles, keywords, tags,
etc.), which can be exploited for better similarity measure with queries. In
this paper, we leverage visual and textual modalities for image search by
learning their correlation with input query. According to the intent of query,
attention mechanism can be introduced to adaptively balance the importance of
different modalities. We propose a novel Attention guided Multi-modal
Correlation (AMC) learning method which consists of a jointly learned hierarchy
of intra and inter-attention networks. Conditioned on query's intent,
intra-attention networks (i.e., visual intra-attention network and language
intra-attention network) attend on informative parts within each modality; a
multi-modal inter-attention network promotes the importance of the most
query-relevant modalities. In experiments, we evaluate AMC models on the search
logs from two real world image search engines and show a significant boost on
the ranking of user-clicked images in search results. Additionally, we extend
AMC models to caption ranking task on COCO dataset and achieve competitive
results compared with recent state-of-the-arts.Comment: CVPR 201
- …