1,139 research outputs found
Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings
Conventional feature-based and model-based gaze estimation methods have
proven to perform well in settings with controlled illumination and specialized
cameras. In unconstrained real-world settings, however, such methods are
surpassed by recent appearance-based methods due to difficulties in modeling
factors such as illumination changes and other visual artifacts. We present a
novel learning-based method for eye region landmark localization that enables
conventional methods to be competitive to latest appearance-based methods.
Despite having been trained exclusively on synthetic data, our method exceeds
the state of the art for iris localization and eye shape registration on
real-world imagery. We then use the detected landmarks as input to iterative
model-fitting and lightweight learning-based gaze estimation methods. Our
approach outperforms existing model-fitting and appearance-based methods in the
context of person-independent and personalized gaze estimation
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Recommended from our members
Mobile localization : approach and applications
textLocalization is critical to a number of wireless network applications. In many situations GPS is not suitable. This dissertation (i) develops novel localization schemes for wireless networks by explicitly incorporating mobility information and (ii) applies localization to physical analytics i.e., understanding shoppers' behavior within retail spaces by leveraging inertial sensors, Wi-Fi and vision enabled by smart glasses. More specifically, we first focus on multi-hop mobile networks, analyze real mobility traces and observe that they exhibit temporal stability and low-rank structure. Motivated by these observations, we develop novel localization algorithms to effectively capture and also adapt to different degrees of these properties. Using extensive simulations and testbed experiments, we demonstrate the accuracy and robustness of our new schemes. Second, we focus on localizing a single mobile node, which may not be connected with multiple nodes (e.g., without network connectivity or only connected with an access point). We propose trajectory-based localization using Wi-Fi or magnetic field measurements. We show that these measurements have the potential to uniquely identify a trajectory. We then develop a novel approach that leverages multi-level wavelet coefficients to first identify the trajectory and then localize to a point on the trajectory. We show that this approach is highly accurate and power efficient using indoor and outdoor experiments. Finally, localization is a critical step in enabling a lot of applications --- an important one is physical analytics. Physical analytics has the potential to provide deep-insight into shoppers' interests and activities and therefore better advertisements, recommendations and a better shopping experience. To enable physical analytics, we build ThirdEye system which first achieves zero-effort localization by leveraging emergent devices like the Google-Glass to build AutoLayout that fuses video, Wi-Fi, and inertial sensor data, to simultaneously localize the shoppers while also constructing and updating the product layout in a virtual coordinate space. Further, ThirdEye comprises of a range of schemes that use a combination of vision and inertial sensing to study mobile users' behavior while shopping, namely: walking, dwelling, gazing and reaching-out. We show the effectiveness of ThirdEye through an evaluation in two large retail stores in the United States.Computer Science
The Stare-In-The-Crowd Effect: Phenomenology, Psychophysiology, And Relations To Psychopathology
The eyes are a valuable source of information for a range of social processes. The stare-in-the-crowd effect describes the ability to detect self-directed gaze. Impairment in gaze detection mechanisms, such as the stare-in-the-crowd effect, has implications for social interactions and development of social relationships. Given the frequency with which humans utilize gaze detection in interactions, there is a need to better characterize the stare-in-the-crowd effect. This study utilized a previously validated dynamic visual paradigm to capture the stare-in-the-crowd effect. We compared typically-developing (TD) young adults and young adults with Autism Spectrum Disorder (ASD) on multiple measures of psychophysiology, including eye tracking and heart rate monitoring. Four conditions of visual stimuli were presented: averted gaze, mutual gaze, catching another staring, and getting caught staring. Eye tracking outcomes and arousal (pupil size and heart rate variability) were compared by diagnosis (TD or ASD) and condition (averted, mutual, catching another staring, getting caught staring) using repeated measure ANOVA. Significant interaction of diagnosis and condition was found for IA dwell time, IA fixation count, and IA second fixation duration. Hierarchical regression was used to assess how dimensional behavioral measures predicted eye tracking outcomes and arousal; only two models with advanced theory of mind as a predictor were significant. Overall, we demonstrated that individuals with ASD do respond differently to various gaze conditions in similar patterns to TD individuals, but to a lesser extent. This offers potential targets to social interventions to capitalize on this present but underdeveloped response to gaze. Implications and future directions are discussed
Customer Gaze Estimation in Retail Using Deep Learning
At present, intelligent computing applications are widely used in different domains, including retail stores. The analysis of customer behaviour has become crucial for the benefit of both customers and retailers. In this regard, the concept of remote gaze estimation using deep learning has shown promising results in analyzing customer behaviour in retail due to its scalability, robustness, low cost, and uninterrupted nature. This study presents a three-stage, three-attention-based deep convolutional neural network for remote gaze estimation in retail using image data. In the first stage, we design a mechanism to estimate the 3D gaze of the subject using image data and monocular depth estimation. The second stage presents a novel three-attention mechanism to estimate the gaze in the wild from field-of-view, depth range, and object channel attentions. The third stage generates the gaze saliency heatmap from the output attention map of the second stage. We train and evaluate the proposed model using benchmark GOO-Real dataset and compare results with baseline models. Further, we adapt our model to real-retail environments by introducing a novel Retail Gaze dataset. Extensive experiments demonstrate that our approach significantly improves remote gaze target estimation performance on GOO-Real and Retail Gaze datasets
The scene superiority effect: object recognition in the context of natural scenes
Four experiments investigate the effect of background scene semantics on object recognition. Although past research has found that semantically consistent scene backgrounds can facilitate recognition of a target object, these claims have been challenged as the result of post-perceptual response bias rather than the perceptual processes of object recognition itself. The current study takes advantage of a paradigm from linguistic processing known as the Word Superiority Effect. Humans can better discriminate letters (e.g., D vs. K) in the context of a word (WORD vs. WORK) than in a non-word context (e.g., WROD vs. WROK) even when the context is non-predictive of the target identity. We apply this paradigm to objects in natural scenes, having subjects discriminate between objects in the context of scenes. Because the target objects were equally semantically consistent with any given scene and could appear in either semantically consistent or inconsistent contexts with equal probability, response bias could not lead to an apparent improvement in object recognition. The current study found a benefit to object recognition from semantically consistent backgrounds, and the effect appeared to be modulated by awareness of background scene semantics
Efficient human annotation schemes for training object class detectors
A central task in computer vision is detecting object classes such as cars and horses
in complex scenes. Training an object class detector typically requires a large set of
images labeled with tight bounding boxes around every object instance. Obtaining
such data requires human annotation, which is very expensive and time consuming.
Alternatively, researchers have tried to train models in a weakly supervised setting (i.e.,
given only image-level labels), which is much cheaper but leads to weaker detectors.
In this thesis, we propose new and efficient human annotation schemes for training
object class detectors that bypass the need for drawing bounding boxes and reduce the
annotation cost while still obtaining high quality object detectors.
First, we propose to train object class detectors from eye tracking data. Instead
of drawing tight bounding boxes, the annotators only need to look at the image and
find the target object. We track the eye movements of annotators while they perform
this visual search task and we propose a technique for deriving object bounding boxes
from these eye fixations. To validate our idea, we augment an existing object detection
dataset with eye tracking data.
Second, we propose a scheme for training object class detectors, which only requires
annotators to verify bounding-boxes produced automatically by the learning
algorithm. Our scheme introduces human verification as a new step into a standard
weakly supervised framework which typically iterates between re-training object detectors
and re-localizing objects in the training images. We use the verification signal
to improve both re-training and re-localization.
Third, we propose another scheme where annotators are asked to click on the center
of an imaginary bounding box, which tightly encloses the object. We then incorporate
these clicks into a weakly supervised object localization technique, to jointly localize
object bounding boxes over all training images. Both our center-clicking and human
verification schemes deliver detectors performing almost as well as those trained in a
fully supervised setting.
Finally, we propose extreme clicking. We ask the annotator to click on four physical
points on the object: the top, bottom, left- and right-most points. This task is more
natural than the traditional way of drawing boxes and these points are easy to find. Our
experiments show that annotating objects with extreme clicking is 5 X faster than the
traditional way of drawing boxes and it leads to boxes of the same quality as the original
ground-truth drawn the traditional way. Moreover, we use the resulting extreme
points to obtain more accurate segmentations than those derived from bounding boxes
- …