10 research outputs found

    Computer models of saliency alone fail to predict subjective visual attention to landmarks during observed navigation

    Get PDF
    This study aimed to understand whether or not computer models of saliency could explain landmark saliency. An online survey was conducted and participants were asked to watch videos from a spatial navigation video game (Sea Hero Quest). Participants were asked to pay attention to the environments within which the boat was moving and to rate the perceived saliency of each landmark. In addition, state-of-the-art computer saliency models were used to objectively quantify landmark saliency. No significant relationship was found between objective and subjective saliency measures. This indicates that during passive observation of an environment while being navigated, current automated models of saliency fail to predict subjective reports of visual attention to landmarks

    Computer models of saliency alone fail to predict subjective visual attention to landmarks during observed navigation

    Get PDF
    In this study, it was aimed to understand whether or not computer models of saliency could explain landmark saliency. An online survey was conducted and participants were asked to watch videos from a spatial navigation video game (Sea Hero Quest). Participants were asked to pay attention to the environments within which the boat was moving and to rate the perceived saliency of each landmark. In addition, state-of-the-art computer saliency models were used to objectively quantify landmark saliency. No significant relationship was found between objective and subjective saliency measures. This indicates that during passive observation of an environment being navigated, current automated models of saliency fail to predict subjective reports of visual attention to landmarks

    Audio-visual saliency prediction for 360â—¦ video via deep learning.

    Get PDF
    The interest in virtual reality (VR) has rapidly grown in recent years, being now widely available to consumers in different forms. This technology provides an unprecedented level of immersion, creating many new possibilities that could change the way people experience digital content. Understanding how users behave and interact with virtual experiences could be decisive for many different applications such as designing better virtual experiences, advanced compression techniques, or medical diagnosis.One of the most critical areas in the study of human behaviour is visual attention. It refers to to the qualities that different items have which makes them stand out and attract our attention.Despite the fact that there have been significant advances in this field in recent years, saliency prediction remains a very challenging problem due to the many factors that affect the behaviour of the observer, such as stimuli sources of different types or users having different backgrounds and emotional states. On top of that, saliency prediction for VR content is even more difficult as this form of media presents additional challenges such as distortions, users having control of the camera, or different stimuli possibly being located outside the current view of the observer.This work proposes a novel saliency prediction solution for 360â—¦ video based on deep learning. Deep learning has been proven to obtain outstanding results in many different image and video tasks, including saliency prediction. Although most works in this field focus solely on visual information, the proposed model incorporates both visual and directional audio information with the objective of obtaining more accurate predictions. It uses a series of convolutional neural networks (CNNs) specially designed for VR content, and it is able to learn spatio-temporal visual and auditory features by using three-dimensional convolutions. It is the first solution to make use of directional audio without the need for a hand-crafted attention modelling technique. The proposed model is evaluated using a publicly available dataset. The results show that it outperforms previous state-of-the-art work in both quantitative and qualitative analysis. Additionally, various ablation studies are presented, supporting the decisions made during the design phase of the model.<br /

    Learning Gaze Transitions from Depth to Improve Video Saliency Estimation

    No full text
    © 2017 IEEE. In this paper we introduce a novel Depth-Aware Video Saliency approach to predict human focus of attention when viewing videos that contain a depth map (RGBD) on a 2D screen. Saliency estimation in this scenario is highly important since in the near future 3D video content will be easily acquired yet hard to display. Despite considerable progress in 3D display technologies, most are still expensive and require special glasses for viewing, so RGBD content is primarily viewed on 2D screens, removing the depth channel from the final viewing experience. We train a generative convolutional neural network that predicts the 2D viewing saliency map for a given frame using the RGBD pixel values and previous fixation estimates in the video. To evaluate the performance of our approach, we present a new comprehensive database of 2D viewing eye-fixation ground-truth for RGBD videos. Our experiments indicate that it is beneficial to integrate depth into video saliency estimates for content that is viewed on a 2D display. We demonstrate that our approach outperforms state-of-the-art methods for video saliency, achieving 15% relative improvement

    Visual saliency prediction based on deep learning

    Get PDF
    The Human Visual System (HVS) has the ability to focus on specific parts of a scene, rather than the whole image. Human eye movement is also one of the primary functions used in our daily lives that helps us understand our surroundings. This phenomenon is one of the most active research topics in the computer vision and neuroscience fields. The outcomes that have been achieved by neural network methods in a variety of tasks have highlighted their ability to predict visual saliency. In particular, deep learning models have been used for visual saliency prediction. In this thesis, a deep learning method based on a transfer learning strategy is proposed (Chapter 2), wherein visual features in the convolutional layers are extracted from raw images to predict visual saliency (e.g., saliency map). Specifically, the proposed model uses the VGG-16 network (i.e., Pre-trained CNN model) for semantic segmentation. The proposed model is applied to several datasets, including TORONTO, MIT300, MIT1003, and DUT-OMRON, to illustrate its efficiency. The results of the proposed model are then quantitatively and qualitatively compared to classic and state-of-the-art deep learning models. In Chapter 3, I specifically investigate the performance of five state-of-the-art deep neural networks (VGG-16, ResNet-50, Xception, InceptionResNet-v2, and MobileNet-v2) for the task of visual saliency prediction. Five deep learning models were trained over the SALICON dataset and used to predict visual saliency maps using four standard datasets, namely TORONTO, MIT300, MIT1003, and DUT-OMRON. The results indicate that the ResNet-50 model outperforms the other four and provides a visual saliency map that is very close to human performance. In Chapter 4, a novel deep learning model based on a Fully Convolutional Network (FCN) architecture is proposed. The proposed model is trained in an end-to-end style and designed to predict visual saliency. The model is based on the encoder-decoder structure and includes two types of modules. The first has three stages of inception modules to improve multi-scale derivation and enhance contextual information. The second module includes one stage of the residual module to provide a more accurate recovery of information and to simplify optimization. The entire proposed model is fully trained from scratch to extract distinguishing features and to use a data augmentation technique to create variations in the images. The proposed model is evaluated using several benchmark datasets, including MIT300, MIT1003, TORONTO, and DUT-OMRON. The quantitative and qualitative experiment analyses demonstrate that the proposed model achieves superior performance for predicting visual saliency. In Chapter 5, I study the possibility of using deep learning techniques for Salient Object Detection (SOD) because this work is slightly related to the problem of Visual saliency prediction. Therefore, in this work, the capability of ten well-known pre-trained models for semantic segmentation, including FCNs, VGGs, ResNets, MobileNet-v2, Xception, and InceptionResNet-v2, are investigated. These models have been trained over an ImageNet dataset, fine-tuned on a MSRA-10K dataset, and evaluated using other public datasets, such as ECSSD, MSRA-B, DUTS, and THUR15k. The results illustrate the superiority of ResNet50 and ResNet18, which have Mean Absolute Errors (MAE) of approximately 0.93 and 0.92, respectively, compared to other well-known FCN models. Finally, conclusions are drawn, and possible future works are discussed in chapter 6

    The relationship between wayfinding performance, spatial layout and landmarks in virtual environments

    Get PDF
    Environmental factors, including landmarks that affect people’s wayfinding performance in unfamiliar environments have been discussed in a great number of studies. However, there is still no consensus on the factors that shape people’s performance or what makes a landmark preferable during wayfinding. Hence, this study aims to understand the impact of different spatial layouts, environmental conditions and landmarks on people’s wayfinding performance, and the factors that make landmarks salient. Sea Hero Quest (SHQ), an online game that has been played by more than 4.3 million people from 2016 to date, is selected as a case study to investigate the impact of different environments and other factors, in particular landmarks. Forty-five wayfinding levels of SHQ are analysed and compared using Geographic Information System (GIS) and Space syntax axial, segment and visibility graph analyses. A cluster analysis is conducted to examine the relationship between levels. Varying conditions associated with landmarks, weather and maps were taken into consideration. In order to investigate the process of selecting landmarks, visual, structural (whether landmarks are global or local) and cognitive saliency are analysed using web-based surveys, saliency algorithms and the visibility of landmarks. Results of this study show that the complexity of layouts plays a major role in wayfinding; as the complexity of layout increases, so does the time taken to complete the wayfinding task. Similarly, the weather condition has an effect; as the weather becomes foggy and visibility decreases, the time taken to complete the wayfinding task increases. It is discovered that landmarks that are visible for more than 25% of a journey can be defined as global landmarks whereas the rest can be defined as local landmarks. Findings also show that landmarks that are visually salient (objects with a unique colour and size) and structurally salient (objects that are closer to people) are registered more by people in unfamiliar environments. This study contributes to the existing literature by exploring the factors that affect people’s wayfinding performance by using the largest dataset in the field (so providing more accurate results), focusing on 45 different layouts (while current research studies mostly focus on one or two different layouts), by proposing a threshold to distinguish global and local landmarks, and analysing visual, structural and cognitive saliency through various measures
    corecore