5,454 research outputs found
Recognition of Activities from Eye Gaze and Egocentric Video
This paper presents a framework for recognition of human activity from
egocentric video and eye tracking data obtained from a head-mounted eye
tracker. Three channels of information such as eye movement, ego-motion, and
visual features are combined for the classification of activities. Image
features were extracted using a pre-trained convolutional neural network. Eye
and ego-motion are quantized, and the windowed histograms are used as the
features. The combination of features obtains better accuracy for activity
classification as compared to individual features.Comment: 7 pages, 9 figure
Relevance Prediction from Eye-movements Using Semi-interpretable Convolutional Neural Networks
We propose an image-classification method to predict the perceived-relevance
of text documents from eye-movements. An eye-tracking study was conducted where
participants read short news articles, and rated them as relevant or irrelevant
for answering a trigger question. We encode participants' eye-movement
scanpaths as images, and then train a convolutional neural network classifier
using these scanpath images. The trained classifier is used to predict
participants' perceived-relevance of news articles from the corresponding
scanpath images. This method is content-independent, as the classifier does not
require knowledge of the screen-content, or the user's information-task. Even
with little data, the image classifier can predict perceived-relevance with up
to 80% accuracy. When compared to similar eye-tracking studies from the
literature, this scanpath image classification method outperforms previously
reported metrics by appreciable margins. We also attempt to interpret how the
image classifier differentiates between scanpaths on relevant and irrelevant
documents
A Collaborative Computer Aided Diagnosis (C-CAD) System with Eye-Tracking, Sparse Attentional Model, and Deep Learning
There are at least two categories of errors in radiology screening that can
lead to suboptimal diagnostic decisions and interventions:(i)human fallibility
and (ii)complexity of visual search. Computer aided diagnostic (CAD) tools are
developed to help radiologists to compensate for some of these errors. However,
despite their significant improvements over conventional screening strategies,
most CAD systems do not go beyond their use as second opinion tools due to
producing a high number of false positives, which human interpreters need to
correct. In parallel with efforts in computerized analysis of radiology scans,
several researchers have examined behaviors of radiologists while screening
medical images to better understand how and why they miss tumors, how they
interact with the information in an image, and how they search for unknown
pathology in the images. Eye-tracking tools have been instrumental in exploring
answers to these fundamental questions. In this paper, we aim to develop a
paradigm shift CAD system, called collaborative CAD (C-CAD), that unifies both
of the above mentioned research lines: CAD and eye-tracking. We design an
eye-tracking interface providing radiologists with a real radiology reading
room experience. Then, we propose a novel algorithm that unifies eye-tracking
data and a CAD system. Specifically, we present a new graph based clustering
and sparsification algorithm to transform eye-tracking data (gaze) into a
signal model to interpret gaze patterns quantitatively and qualitatively. The
proposed C-CAD collaborates with radiologists via eye-tracking technology and
helps them to improve diagnostic decisions. The C-CAD learns radiologists'
search efficiency by processing their gaze patterns. To do this, the C-CAD uses
a deep learning algorithm in a newly designed multi-task learning platform to
segment and diagnose cancers simultaneously.Comment: Submitted to Medical Image Analysis Journal (MedIA
Deep Pictorial Gaze Estimation
Estimating human gaze from natural eye images only is a challenging task.
Gaze direction can be defined by the pupil- and the eyeball center where the
latter is unobservable in 2D images. Hence, achieving highly accurate gaze
estimates is an ill-posed problem. In this paper, we introduce a novel deep
neural network architecture specifically designed for the task of gaze
estimation from single eye input. Instead of directly regressing two angles for
the pitch and yaw of the eyeball, we regress to an intermediate pictorial
representation which in turn simplifies the task of 3D gaze direction
estimation. Our quantitative and qualitative results show that our approach
achieves higher accuracies than the state-of-the-art and is robust to variation
in gaze, head pose and image quality
GlobeNet: Convolutional Neural Networks for Typhoon Eye Tracking from Remote Sensing Imagery
Advances in remote sensing technologies have made it possible to use
high-resolution visual data for weather observation and forecasting tasks. We
propose the use of multi-layer neural networks for understanding complex
atmospheric dynamics based on multichannel satellite images. The capability of
our model was evaluated by using a linear regression task for single typhoon
coordinates prediction. A specific combination of models and different
activation policies enabled us to obtain an interesting prediction result in
the northeastern hemisphere (ENH).Comment: Under review as a workshop paper at CI 201
Saliency Driven Object recognition in egocentric videos with deep CNN
The problem of object recognition in natural scenes has been recently
successfully addressed with Deep Convolutional Neuronal Networks giving a
significant break-through in recognition scores. The computational efficiency
of Deep CNNs as a function of their depth, allows for their use in real-time
applications. One of the key issues here is to reduce the number of windows
selected from images to be submitted to a Deep CNN. This is usually solved by
preliminary segmentation and selection of specific windows, having outstanding
"objectiveness" or other value of indicators of possible location of objects.
In this paper we propose a Deep CNN approach and the general framework for
recognition of objects in a real-time scenario and in an egocentric
perspective. Here the window of interest is built on the basis of visual
attention map computed over gaze fixations measured by a glass-worn
eye-tracker. The application of this set-up is an interactive user-friendly
environment for upper-limb amputees. Vision has to help the subject to control
his worn neuro-prosthesis in case of a small amount of remaining muscles when
the EMG control becomes unefficient. The recognition results on a specifically
recorded corpus of 151 videos with simple geometrical objects show the mAP of
64,6\% and the computational time at the generalization lower than a time of a
visual fixation on the object-of-interest.Comment: 20 pages, 8 figures, 3 tables, Submitted to the Journal of Computer
Vision and Image Understandin
WW-Nets: Dual Neural Networks for Object Detection
We propose a new deep convolutional neural network framework that uses object
location knowledge implicit in network connection weights to guide selective
attention in object detection tasks. Our approach is called What-Where Nets
(WW-Nets), and it is inspired by the structure of human visual pathways. In the
brain, vision incorporates two separate streams, one in the temporal lobe and
the other in the parietal lobe, called the ventral stream and the dorsal
stream, respectively. The ventral pathway from primary visual cortex is
dominated by "what" information, while the dorsal pathway is dominated by
"where" information. Inspired by this structure, we have proposed an object
detection framework involving the integration of a "What Network" and a "Where
Network". The aim of the What Network is to provide selective attention to the
relevant parts of the input image. The Where Network uses this information to
locate and classify objects of interest. In this paper, we compare this
approach to state-of-the-art algorithms on the PASCAL VOC 2007 and 2012 and
COCO object detection challenge datasets. Also, we compare out approach to
human "ground-truth" attention. We report the results of an eye-tracking
experiment on human subjects using images from PASCAL VOC 2007, and we
demonstrate interesting relationships between human overt attention and
information processing in our WW-Nets. Finally, we provide evidence that our
proposed method performs favorably in comparison to other object detection
approaches, often by a large margin. The code and the eye-tracking ground-truth
dataset can be found at: https://github.com/mkebrahimpour.Comment: 8 pages, 3 figure
DeepFix: A Fully Convolutional Neural Network for predicting Human Eye Fixations
Understanding and predicting the human visual attentional mechanism is an
active area of research in the fields of neuroscience and computer vision. In
this work, we propose DeepFix, a first-of-its-kind fully convolutional neural
network for accurate saliency prediction. Unlike classical works which
characterize the saliency map using various hand-crafted features, our model
automatically learns features in a hierarchical fashion and predicts saliency
map in an end-to-end manner. DeepFix is designed to capture semantics at
multiple scales while taking global context into account using network layers
with very large receptive fields. Generally, fully convolutional nets are
spatially invariant which prevents them from modeling location dependent
patterns (e.g. centre-bias). Our network overcomes this limitation by
incorporating a novel Location Biased Convolutional layer. We evaluate our
model on two challenging eye fixation datasets -- MIT300, CAT2000 and show that
it outperforms other recent approaches by a significant margin
Top-Down Saliency Detection Driven by Visual Classification
This paper presents an approach for top-down saliency detection guided by
visual classification tasks. We first learn how to compute visual saliency when
a specific visual task has to be accomplished, as opposed to most
state-of-the-art methods which assess saliency merely through bottom-up
principles. Afterwards, we investigate if and to what extent visual saliency
can support visual classification in nontrivial cases. To achieve this, we
propose SalClassNet, a CNN framework consisting of two networks jointly
trained: a) the first one computing top-down saliency maps from input images,
and b) the second one exploiting the computed saliency maps for visual
classification. To test our approach, we collected a dataset of eye-gaze maps,
using a Tobii T60 eye tracker, by asking several subjects to look at images
from the Stanford Dogs dataset, with the objective of distinguishing dog
breeds. Performance analysis on our dataset and other saliency bench-marking
datasets, such as POET, showed that SalClassNet out-performs state-of-the-art
saliency detectors, such as SalNet and SALICON. Finally, we analyzed the
performance of SalClassNet in a fine-grained recognition task and found out
that it generalizes better than existing visual classifiers. The achieved
results, thus, demonstrate that 1) conditioning saliency detectors with object
classes reaches state-of-the-art performance, and 2) providing explicitly
top-down saliency maps to visual classifiers enhances classification accuracy
Eyemotion: Classifying facial expressions in VR using eye-tracking cameras
One of the main challenges of social interaction in virtual reality settings
is that head-mounted displays occlude a large portion of the face, blocking
facial expressions and thereby restricting social engagement cues among users.
Hence, auxiliary means of sensing and conveying these expressions are needed.
We present an algorithm to automatically infer expressions by analyzing only a
partially occluded face while the user is engaged in a virtual reality
experience. Specifically, we show that images of the user's eyes captured from
an IR gaze-tracking camera within a VR headset are sufficient to infer a select
subset of facial expressions without the use of any fixed external camera.
Using these inferences, we can generate dynamic avatars in real-time which
function as an expressive surrogate for the user. We propose a novel data
collection pipeline as well as a novel approach for increasing CNN accuracy via
personalization. Our results show a mean accuracy of 74% ( of 0.73) among 5
`emotive' expressions and a mean accuracy of 70% ( of 0.68) among 10
distinct facial action units, outperforming human raters.Comment: Uploaded Supplementary PDF. Fixed author affiliation. Corrected typo
in personalization accurac
- …