Most studies in computational modeling of visual attention encompass
task-free observation of images. Free-viewing saliency considers limited
scenarios of daily life. Most visual activities are goal-oriented and demand a
great amount of top-down attention control. Visual search task demands more
top-down control of attention, compared to free-viewing. In this paper, we
present two approaches to model visual attention and distraction of observers
during visual search. Our first approach adapts a light-weight free-viewing
saliency model to predict eye fixation density maps of human observers over
pixels of search images, using a two-stream convolutional encoder-decoder
network, trained and evaluated on COCO-Search18 dataset. This method predicts
which locations are more distracting when searching for a particular target.
Our network achieves good results on standard saliency metrics (AUC-Judd=0.95,
AUC-Borji=0.85, sAUC=0.84, NSS=4.64, KLD=0.93, CC=0.72, SIM=0.54, and IG=2.59).
Our second approach is object-based and predicts the distractor and target
objects during visual search. Distractors are all objects except the target
that observers fixate on during search. This method uses a Mask-RCNN
segmentation network pre-trained on MS-COCO and fine-tuned on COCO-Search18
dataset. We release our segmentation annotations of targets and distractors in
COCO-Search18 for three target categories: bottle, bowl, and car. The average
scores over the three categories are: F1-score=0.64, MAP(iou:0.5)=0.57,
MAR(iou:0.5)=0.73. Our implementation code in Tensorflow is publicly available
at https://github.com/ManooshSamiei/Distraction-Visual-Search .Comment: 33 pages, 24 figures, 12 tables, this is a pre-print manuscript
currently under review in Journal of Visio