1 research outputs found
Weakly Supervised Attention Learning for Textual Phrases Grounding
Grounding textual phrases in visual content is a meaningful yet challenging
problem with various potential applications such as image-text inference or
text-driven multimedia interaction. Most of the current existing methods adopt
the supervised learning mechanism which requires ground-truth at pixel level
during training. However, fine-grained level ground-truth annotation is quite
time-consuming and severely narrows the scope for more general applications. In
this extended abstract, we explore methods to localize flexibly image regions
from the top-down signal (in a form of one-hot label or natural languages) with
a weakly supervised attention learning mechanism. In our model, two types of
modules are utilized: a backbone module for visual feature capturing, and an
attentive module generating maps based on regularized bilinear pooling. We
construct the model in an end-to-end fashion which is trained by encouraging
the spatial attentive map to shift and focus on the region that consists of the
best matched visual features with the top-down signal. We demonstrate the
preliminary yet promising results on a testbed that is synthesized with
multi-label MNIST data.Comment: 4 pages, 3 figure