39 research outputs found

    Using Deep Features to Predict Where People Look

    Get PDF
    When free-viewing scenes, the first few fixations of human observers are driven in part by bottom-up attention. We seek to characterize this process by extracting all information from images that can be used to predict fixation densities (Kuemmerer et al, PNAS, 2015). If we ignore time and observer identity, the average amount of information is slightly larger than 2 bits per image for the MIT 1003 dataset. The minimum amount of information is 0.3 bits and the maximum 5.2 bits. Before the rise of deep neural networks the best models were able to capture 1/3 of this information on average. We developed new saliency algorithms based on high-performing convolutional neural networks such as AlexNet or VGG-19 that have been shown to provide generally useful representations of natural images. Using a transfer learning paradigm we first developed DeepGaze I based on AlexNet that captures 56% of the total information. Subsequently, we developed DeepGaze II based on VGG-19 that captures 88% and is state-of-the-art on the MIT 300 benchmark dataset. We will show best case and worst case examples as well as feature selection methods to visualize which structures in the image are critical for predicting fixation densities

    Evaluating Models of Scanpath Prediction

    Get PDF

    Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet

    Full text link
    Recent results suggest that state-of-the-art saliency models perform far from optimal in predicting fixations. This lack in performance has been attributed to an inability to model the influence of high-level image features such as objects. Recent seminal advances in applying deep neural networks to tasks like object recognition suggests that they are able to capture this kind of structure. However, the enormous amount of training data necessary to train these networks makes them difficult to apply directly to saliency prediction. We present a novel way of reusing existing neural networks that have been pretrained on the task of object recognition in models of fixation prediction. Using the well-known network of Krizhevsky et al. (2012), we come up with a new saliency model that significantly outperforms all state-of-the-art models on the MIT Saliency Benchmark. We show that the structure of this network allows new insights in the psychophysics of fixation selection and potentially their neural implementation. To train our network, we build on recent work on the modeling of saliency as point processes

    Predicting Fixations From Deep and Low-Level Features

    Get PDF
    Learning what properties of an image are associated with human gaze placement is important both for understanding how biological systems explore the environment and for computer vision applications. Recent advances in deep learning for the first time enable us to explain a significant portion of the information expressed in the spatial fixation structure. Our saliency model DeepGaze II uses the VGG network (trained on object recognition in the ImageNet challenge) to convert an image into a high-dimensional feature space which is then readout by a second very simple network to yield a density prediction. DeepGaze II is right now the best performing model for predicting fixations when freeviewing still images (MIT Saliency Benchmark, AUC and sAUC). By retraining on other datasets, we can explore how the features driving fixations change over different tasks or over presentation time. Additionally, the modular architecture of DeepGaze II allows us to quantify how predictive certain features are for fixations. We demonstrate this by replacing the VGG network with very simple isotropic mean-luminance-contrast features and end up with a network that outperforms all previous saliency models before the models that used pretrained deep networks (including models with high-level features like Judd or eDN). Using DeepGaze and the Mean-Luminance-Contrast model (MLC), we can separate how much low-level and high-level features contribute to fixation selection in different situations
    corecore