The prediction of human gaze behavior is important for building
human-computer interactive systems that can anticipate a user's attention.
Computer vision models have been developed to predict the fixations made by
people as they search for target objects. But what about when the image has no
target? Equally important is to know how people search when they cannot find a
target, and when they would stop searching. In this paper, we propose the first
data-driven computational model that addresses the search-termination problem
and predicts the scanpath of search fixations made by people searching for
targets that do not appear in images. We model visual search as an imitation
learning problem and represent the internal knowledge that the viewer acquires
through fixations using a novel state representation that we call Foveated
Feature Maps (FFMs). FFMs integrate a simulated foveated retina into a
pretrained ConvNet that produces an in-network feature pyramid, all with
minimal computational overhead. Our method integrates FFMs as the state
representation in inverse reinforcement learning. Experimentally, we improve
the state of the art in predicting human target-absent search behavior on the
COCO-Search18 datasetComment: Accepted to ECCV202