45 research outputs found
Learning visual saliency by combining feature maps in a nonlinear manner using AdaBoost
To predict where subjects look under natural viewing conditions, biologically inspired saliency models decompose visual input into a set of feature maps across spatial scales. The output of these feature maps are summed to yield the final saliency map. We studied the integration of bottom-up feature maps across multiple spatial scales by using eye movement data from four recent eye tracking datasets. We use AdaBoost as the central computational module that takes into account feature selection, thresholding, weight assignment, and integration in a principled and nonlinear learning framework. By combining the output of feature maps via a series of nonlinear classifiers, the new model consistently predicts eye movements better than any of its competitors
Surgical Phase Recognition of Short Video Shots Based on Temporal Modeling of Deep Features
Recognizing the phases of a laparoscopic surgery (LS) operation form its
video constitutes a fundamental step for efficient content representation,
indexing and retrieval in surgical video databases. In the literature, most
techniques focus on phase segmentation of the entire LS video using
hand-crafted visual features, instrument usage signals, and recently
convolutional neural networks (CNNs). In this paper we address the problem of
phase recognition of short video shots (10s) of the operation, without
utilizing information about the preceding/forthcoming video frames, their phase
labels or the instruments used. We investigate four state-of-the-art CNN
architectures (Alexnet, VGG19, GoogleNet, and ResNet101), for feature
extraction via transfer learning. Visual saliency was employed for selecting
the most informative region of the image as input to the CNN. Video shot
representation was based on two temporal pooling mechanisms. Most importantly,
we investigate the role of 'elapsed time' (from the beginning of the
operation), and we show that inclusion of this feature can increase performance
dramatically (69% vs. 75% mean accuracy). Finally, a long short-term memory
(LSTM) network was trained for video shot classification based on the fusion of
CNN features with 'elapsed time', increasing the accuracy to 86%. Our results
highlight the prominent role of visual saliency, long-range temporal recursion
and 'elapsed time' (a feature so far ignored), for surgical phase recognition.Comment: 6 pages, 4 figures, 6 table
A Software Retina for Egocentric & Robotic Vision Applications on Mobile Platforms
We present work in progress to develop a low-cost highly
integrated camera sensor for egocentric and robotic vision. Our underlying
approach is to address current limitations to image analysis by Deep
Convolutional Neural Networks, such as the requirement to learn simple
scale and rotation transformations, which contribute to the large computational
demands for training and opaqueness of the learned structure,
by applying structural constraints based on known properties of the human
visual system. We propose to apply a version of the retino-cortical
transform to reduce the dimensionality of the input image space by a
factor of ex100, and map this spatially to transform rotations and scale
changes into spatial shifts. By reducing the input image size accordingly,
and therefore learning requirements, we aim to develop compact and
lightweight egocentric and robot vision sensor using a smartphone as the
target platfor
3D Visual saliency: an independent perceptual measure or a derivative of 2d image saliency?
While 3D visual saliency aims to predict regional importance of 3D surfaces in agreement with human visual perception and has been well researched in computer vision and graphics, latest work with eye-tracking experiments shows that state-of-the-art 3D visual saliency methods remain poor at predicting human fixations. Cues emerging prominently from these experiments suggest that 3D visual saliency might associate with 2D image saliency. This paper proposes a framework that combines a Generative Adversarial Network and a Conditional Random Field for learning visual saliency of both a single 3D object and a scene composed of multiple 3D objects with image saliency ground truth to 1) investigate whether 3D visual saliency is an independent perceptual measure or just a derivative of image saliency and 2) provide a weakly supervised method for more accurately predicting 3D visual saliency. Through extensive experiments, we not only demonstrate that our method significantly outperforms the state-of-the-art approaches, but also manage to answer the interesting and worthy question proposed within the title of this pape
Visual Task Classification using Classic Machine Learning and CNNs
Our eyes actively perform tasks including, but not limited to, searching, comparing, and counting. This includes tasks in front of a computer, whether it be trivial activities like reading email, or video gaming, or more serious activities like drone management, or flight simulation. Understanding what type of visual task is being performed is important to develop intelligent user interfaces. In this work, we investigated standard machine and deep learning methods to identify the task type using eye-tracking data-including both raw numerical data and the visual representations of the user gaze scan paths and pupil size. To this end, we experimented with computer vision algorithms such as Convolutional Neural Networks (CNNs) and compared the results to classic machine learning algorithms. We found that Machine learning-based methods performed with high accuracy classifying tasks that involve minimal visual search, while CNNs techniques do better in situations where visual search task is included
Low-level spatiochromatic grouping for saliency estimation
We propose a saliency model termed SIM (saliency by induction mechanisms), which is based on a low-level spatiochromatic model that has successfully predicted chromatic induction phenomena. In so doing, we hypothesize that the low-level visual mechanisms that enhance or suppress image detail are also responsible for making some image regions more salient. Moreover, SIM adds geometrical grouplets to enhance complex low-level features such as corners, and suppress relatively simpler features such as edges. Since our model has been fitted on psychophysical chromatic induction data, it is largely nonparametric. SIM outperforms state-of-the-art methods in predicting eye fixations on two datasets and using two metrics