23 research outputs found
Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors
Understanding the mechanisms underlying human attention is a fundamental
challenge for both vision science and artificial intelligence. While numerous
computational models of free-viewing have been proposed, less is known about
the mechanisms underlying task-driven image exploration. To address this gap,
we present CapMIT1003, a database of captions and click-contingent image
explorations collected during captioning tasks. CapMIT1003 is based on the same
stimuli from the well-known MIT1003 benchmark, for which eye-tracking data
under free-viewing conditions is available, which offers a promising
opportunity to concurrently study human attention under both tasks. We make
this dataset publicly available to facilitate future research in this field. In
addition, we introduce NevaClip, a novel zero-shot method for predicting visual
scanpaths that combines contrastive language-image pretrained (CLIP) models
with biologically-inspired neural visual attention (NeVA) algorithms. NevaClip
simulates human scanpaths by aligning the representation of the foveated visual
stimulus and the representation of the associated caption, employing
gradient-driven visual exploration to generate scanpaths. Our experimental
results demonstrate that NevaClip outperforms existing unsupervised
computational models of human visual attention in terms of scanpath
plausibility, for both captioning and free-viewing tasks. Furthermore, we show
that conditioning NevaClip with incorrect or misleading captions leads to
random behavior, highlighting the significant impact of caption guidance in the
decision-making process. These findings contribute to a better understanding of
mechanisms that guide human attention and pave the way for more sophisticated
computational approaches to scanpath prediction that can integrate direct
top-down guidance of downstream tasks
Show, Recall, and Tell: Image Captioning with Recall Mechanism
Generating natural and accurate descriptions in image cap-tioning has always
been a challenge. In this paper, we pro-pose a novel recall mechanism to
imitate the way human con-duct captioning. There are three parts in our recall
mecha-nism : recall unit, semantic guide (SG) and recalled-wordslot (RWS).
Recall unit is a text-retrieval module designedto retrieve recalled words for
images. SG and RWS are de-signed for the best use of recalled words. SG branch
cangenerate a recalled context, which can guide the process ofgenerating
caption. RWS branch is responsible for copyingrecalled words to the caption.
Inspired by pointing mecha-nism in text summarization, we adopt a soft switch
to balancethe generated-word probabilities between SG and RWS. Inthe CIDEr
optimization step, we also introduce an individualrecalled-word reward (WR) to
boost training. Our proposedmethods (SG+RWS+WR) achieve BLEU-4 / CIDEr /
SPICEscores of 36.6 / 116.9 / 21.3 with cross-entropy loss and 38.7 /129.1 /
22.4 with CIDEr optimization on MSCOCO Karpathytest split, which surpass the
results of other state-of-the-artmethods.Comment: Published in AAAI 202