770 research outputs found

    Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

    Full text link
    Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in `saccade and fixate' regimes, the methodology and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets like Hollywood-2 and UCF Sports with human eye movements collected under the ecological constraints of the visual action recognition task. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available for video, vision.imar.ro/eyetracking (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel sequential consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the advanced computer vision practice, can lead to state of the art results

    Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

    Full text link
    Over the past few years, deep neural networks (DNNs) have exhibited great success in predicting the saliency of images. However, there are few works that apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of our LEDOV database, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames. Therefore, we develop a two-layer convolutional long short-term memory (2C-LSTM) network in our DNN-based method, using the extracted features of OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can be generated, which consider the transition of attention across video frames. Finally, the experimental results show that our method advances the state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang, Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The European Conference on Computer Vision (ECCV); September 201

    Learning Gaze Transitions from Depth to Improve Video Saliency Estimation

    Full text link
    In this paper we introduce a novel Depth-Aware Video Saliency approach to predict human focus of attention when viewing RGBD videos on regular 2D screens. We train a generative convolutional neural network which predicts a saliency map for a frame, given the fixation map of the previous frame. Saliency estimation in this scenario is highly important since in the near future 3D video content will be easily acquired and yet hard to display. This can be explained, on the one hand, by the dramatic improvement of 3D-capable acquisition equipment. On the other hand, despite the considerable progress in 3D display technologies, most of the 3D displays are still expensive and require wearing special glasses. To evaluate the performance of our approach, we present a new comprehensive database of eye-fixation ground-truth for RGBD videos. Our experiments indicate that integrating depth into video saliency calculation is beneficial. We demonstrate that our approach outperforms state-of-the-art methods for video saliency, achieving 15% relative improvement

    Computational models: Bottom-up and top-down aspects

    Full text link
    Computational models of visual attention have become popular over the past decade, we believe primarily for two reasons: First, models make testable predictions that can be explored by experimentalists as well as theoreticians, second, models have practical and technological applications of interest to the applied science and engineering communities. In this chapter, we take a critical look at recent attention modeling efforts. We focus on {\em computational models of attention} as defined by Tsotsos \& Rothenstein \shortcite{Tsotsos_Rothenstein11}: Models which can process any visual stimulus (typically, an image or video clip), which can possibly also be given some task definition, and which make predictions that can be compared to human or animal behavioral or physiological responses elicited by the same stimulus and task. Thus, we here place less emphasis on abstract models, phenomenological models, purely data-driven fitting or extrapolation models, or models specifically designed for a single task or for a restricted class of stimuli. For theoretical models, we refer the reader to a number of previous reviews that address attention theories and models more generally \cite{Itti_Koch01nrn,Paletta_etal05,Frintrop_etal10,Rothenstein_Tsotsos08,Gottlieb_Balan10,Toet11,Borji_Itti12pami}

    Bottom-up Attention, Models of

    Full text link
    In this review, we examine the recent progress in saliency prediction and proposed several avenues for future research. In spite of tremendous efforts and huge progress, there is still room for improvement in terms finer-grained analysis of deep saliency models, evaluation measures, datasets, annotation methods, cognitive studies, and new applications. This chapter will appear in Encyclopedia of Computational Neuroscience.Comment: arXiv admin note: substantial text overlap with arXiv:1810.0371

    Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction

    Full text link
    Computational saliency models for still images have gained significant popularity in recent years. Saliency prediction from videos, on the other hand, has received relatively little interest from the community. Motivated by this, in this work, we study the use of deep learning for dynamic saliency prediction and propose the so-called spatio-temporal saliency networks. The key to our models is the architecture of two-stream networks where we investigate different fusion mechanisms to integrate spatial and temporal information. We evaluate our models on the DIEM and UCF-Sports datasets and present highly competitive results against the existing state-of-the-art models. We also carry out some experiments on a number of still images from the MIT300 dataset by exploiting the optical flow maps predicted from these images. Our results show that considering inherent motion information in this way can be helpful for static saliency estimation

    Dynamical optical flow of saliency maps for predicting visual attention

    Full text link
    Saliency maps are used to understand human attention and visual fixation. However, while very well established for static images, there is no general agreement on how to compute a saliency map of dynamic scenes. In this paper we propose a mathematically rigorous approach to this prob- lem, including static saliency maps of each video frame for the calculation of the optical flow. Taking into account static saliency maps for calculating the optical flow allows for overcoming the aperture problem. Our ap- proach is able to explain human fixation behavior in situations which pose challenges to standard approaches, such as when a fixated object disappears behind an occlusion and reappears after several frames. In addition, we quantitatively compare our model against alternative solutions using a large eye tracking data set. Together, our results suggest that assessing optical flow information across a series of saliency maps gives a highly accurate and useful account of human overt attention in dynamic scenes

    Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges

    Full text link
    Visual saliency models have enjoyed a big leap in performance in recent years, thanks to advances in deep learning and large scale annotated data. Despite enormous effort and huge breakthroughs, however, models still fall short in reaching human-level accuracy. In this work, I explore the landscape of the field emphasizing on new deep saliency models, benchmarks, and datasets. A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets. Further, I identify factors that contribute to the gap between models and humans and discuss remaining issues that need to be addressed to build the next generation of more powerful saliency models. Some specific questions that are addressed include: in what ways current models fail, how to remedy them, what can be learned from cognitive studies of attention, how explicit saliency judgments relate to fixations, how to conduct fair model comparison, and what are the emerging applications of saliency models

    Computational models of attention

    Full text link
    This chapter reviews recent computational models of visual attention. We begin with models for the bottom-up or stimulus-driven guidance of attention to salient visual items, which we examine in seven different broad categories. We then examine more complex models which address the top-down or goal-oriented guidance of attention towards items that are more relevant to the task at hand

    Benchmark 3D eye-tracking dataset for visual saliency prediction on stereoscopic 3D video

    Full text link
    Visual Attention Models (VAMs) predict the location of an image or video regions that are most likely to attract human attention. Although saliency detection is well explored for 2D image and video content, there are only few attempts made to design 3D saliency prediction models. Newly proposed 3D visual attention models have to be validated over large-scale video saliency prediction datasets, which also contain results of eye-tracking information. There are several publicly available eye-tracking datasets for 2D image and video content. In the case of 3D, however, there is still a need for large-scale video saliency datasets for the research community for validating different 3D-VAMs. In this paper, we introduce a large-scale dataset containing eye-tracking data collected from 61 stereoscopic 3D videos (and also 2D versions of those) and 24 subjects participated in a free-viewing test. We evaluate the performance of the existing saliency detection methods over the proposed dataset. In addition, we created an online benchmark for validating the performance of the existing 2D and 3D visual attention models and facilitate addition of new VAMs to the benchmark. Our benchmark currently contains 50 different VAMs
    • …
    corecore