83 research outputs found
Simple vs complex temporal recurrences for video saliency prediction
This paper investigates modifying an existing neural network architecture for static saliency prediction using two types of recurrences that integrate information from the temporal domain. The first modification is the addition of a ConvLSTM within the architecture, while the second is a conceptually simple exponential moving average of an internal convolutional state. We use weights pre-trained on the SALICON dataset and fine-tune our model on DHF1K. Our results show that both modifications achieve state-of-the-art results and produce similar saliency maps. Source code is available at https://git.io/fjPiB
Unified Image and Video Saliency Modeling
Visual saliency modeling for images and videos is treated as two independent
tasks in recent computer vision literature. While image saliency modeling is a
well-studied problem and progress on benchmarks like SALICON and MIT300 is
slowing, video saliency models have shown rapid gains on the recent DHF1K
benchmark. Here, we take a step back and ask: Can image and video saliency
modeling be approached via a unified model, with mutual benefit? We identify
different sources of domain shift between image and video saliency data and
between different video saliency datasets as a key challenge for effective
joint modelling. To address this we propose four novel domain adaptation
techniques - Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive
Smoothing and Bypass-RNN - in addition to an improved formulation of learned
Gaussian priors. We integrate these techniques into a simple and lightweight
encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and
video saliency data. We evaluate our method on the video saliency datasets
DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and
MIT300. With one set of parameters, UNISAL achieves state-of-the-art
performance on all video saliency datasets and is on par with the
state-of-the-art for image saliency datasets, despite faster runtime and a 5 to
20-fold smaller model size compared to all competing deep methods. We provide
retrospective analyses and ablation studies which confirm the importance of the
domain shift modeling. The code is available at
https://github.com/rdroste/unisalComment: Presented at the European Conference on Computer Vision (ECCV) 2020.
R. Droste and J. Jiao contributed equally to this work. v3: Updated Fig. 5a)
and added new MTI300 benchmark results to supp. materia
Problems with Saliency Maps
Despite the popularity that saliency models have gained in the computer vision community, they are most often conceived, exploited and benchmarked without taking heed of a number of problems and subtle issues they bring about. When saliency maps are used as proxies for the likelihood of fixating a location in a viewed scene, one such issue is the temporal dimension of visual attention deployment. Through a simple simulation it is shown how neglecting this dimension leads to results that at best cast shadows on the predictive performance of a model and its assessment via benchmarking procedures
WinDB: HMD-free and Distortion-free Panoptic Video Fixation Learning
To date, the widely-adopted way to perform fixation collection in panoptic
video is based on a head-mounted display (HMD), where participants' fixations
are collected while wearing an HMD to explore the given panoptic scene freely.
However, this widely-used data collection method is insufficient for training
deep models to accurately predict which regions in a given panoptic are most
important when it contains intermittent salient events. The main reason is that
there always exist "blind zooms" when using HMD to collect fixations since the
participants cannot keep spinning their heads to explore the entire panoptic
scene all the time. Consequently, the collected fixations tend to be trapped in
some local views, leaving the remaining areas to be the "blind zooms".
Therefore, fixation data collected using HMD-based methods that accumulate
local views cannot accurately represent the overall global importance of
complex panoramic scenes. This paper introduces the auxiliary Window with a
Dynamic Blurring (WinDB) fixation collection approach for panoptic video, which
doesn't need HMD and is blind-zoom-free. Thus, the collected fixations can well
reflect the regional-wise importance degree. Using our WinDB approach, we have
released a new PanopticVideo-300 dataset, containing 300 panoptic clips
covering over 225 categories. Besides, we have presented a simple baseline
design to take full advantage of PanopticVideo-300 to handle the
blind-zoom-free attribute-induced fixation shifting problem
How to look next? A data-driven approach for scanpath prediction
By and large, current visual attention models mostly rely, when considering static stimuli, on the following procedure. Given an image, a saliency map is computed, which, in turn, might serve the purpose of predicting a sequence of gaze shifts, namely a scanpath instantiating the dynamics of visual attention deployment. The temporal pattern of attention unfolding is thus confined to the scanpath generation stage, whilst salience is conceived as a static map, at best conflating a number of factors (bottom-up information, top-down, spatial biases, etc.). In this note we propose a novel sequential scheme that consists of a three-stage processing relying on a center-bias model, a context/layout model, and an object-based model, respectively. Each stage contributes, at different times, to the sequential sampling of the final scanpath. We compare the method against classic scanpath generation that exploits state-of-the-art static saliency model. Results show that accounting for the structure of the temporal unfolding leads to gaze dynamics close to human gaze behaviour
- …