14,659 research outputs found
Fusion of gaze with hierarchical image segmentation for robust object detection
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 41).We present Flycatcher, a prototype system illustrating the idea of gaze- based image processing in the context of object segmentation for wearable photography. The prototype includes a wearable eye tracking device that captures real-time eyetraces of a user, and a wearable video camera that captures first-person perspective images of the user's visual environment. The system combines the deliberate eyetraces of the user with hierarchical image segmentation applied to scene images to achieve reliable object segmentation. In evaluations with certain classes of real-world images, fusion of gaze and image segmentation information led to higher object detection accuracy than either signal alone. Flycatcher may be integrated with assistive communication devices, enabling individuals with severe motor impairments to use eye control to communicate about objects in their environment. The system also represents a promising step toward an eye-driven interface for "copy and paste" visual memory augmentation in wearable computing applications.by Jeffrey M. Bartelma.M.Eng
Unsupervised Segmentation of Action Segments in Egocentric Videos using Gaze
Unsupervised segmentation of action segments in egocentric videos is a
desirable feature in tasks such as activity recognition and content-based video
retrieval. Reducing the search space into a finite set of action segments
facilitates a faster and less noisy matching. However, there exist a
substantial gap in machine understanding of natural temporal cuts during a
continuous human activity. This work reports on a novel gaze-based approach for
segmenting action segments in videos captured using an egocentric camera. Gaze
is used to locate the region-of-interest inside a frame. By tracking two simple
motion-based parameters inside successive regions-of-interest, we discover a
finite set of temporal cuts. We present several results using combinations (of
the two parameters) on a dataset, i.e., BRISGAZE-ACTIONS. The dataset contains
egocentric videos depicting several daily-living activities. The quality of the
temporal cuts is further improved by implementing two entropy measures.Comment: To appear in 2017 IEEE International Conference On Signal and Image
Processing Application
Going Deeper into First-Person Activity Recognition
We bring together ideas from recent work on feature design for egocentric
action recognition under one framework by exploring the use of deep
convolutional neural networks (CNN). Recent work has shown that features such
as hand appearance, object attributes, local hand motion and camera ego-motion
are important for characterizing first-person actions. To integrate these ideas
under one framework, we propose a twin stream network architecture, where one
stream analyzes appearance information and the other stream analyzes motion
information. Our appearance stream encodes prior knowledge of the egocentric
paradigm by explicitly training the network to segment hands and localize
objects. By visualizing certain neuron activation of our network, we show that
our proposed architecture naturally learns features that capture object
attributes and hand-object configurations. Our extensive experiments on
benchmark egocentric action datasets show that our deep architecture enables
recognition rates that significantly outperform state-of-the-art techniques --
an average increase in accuracy over all datasets. Furthermore, by
learning to recognize objects, actions and activities jointly, the performance
of individual recognition tasks also increase by (actions) and
(objects). We also include the results of extensive ablative analysis to
highlight the importance of network design decisions.
Expected exponential loss for gaze-based video and volume ground truth annotation
Many recent machine learning approaches used in medical imaging are highly
reliant on large amounts of image and ground truth data. In the context of
object segmentation, pixel-wise annotations are extremely expensive to collect,
especially in video and 3D volumes. To reduce this annotation burden, we
propose a novel framework to allow annotators to simply observe the object to
segment and record where they have looked at with a \$200 eye gaze tracker. Our
method then estimates pixel-wise probabilities for the presence of the object
throughout the sequence from which we train a classifier in semi-supervised
setting using a novel Expected Exponential loss function. We show that our
framework provides superior performances on a wide range of medical image
settings compared to existing strategies and that our method can be combined
with current crowd-sourcing paradigms as well.Comment: 9 pages, 5 figues, MICCAI 2017 - LABELS Worksho
On the Distribution of Salient Objects in Web Images and its Influence on Salient Object Detection
It has become apparent that a Gaussian center bias can serve as an important
prior for visual saliency detection, which has been demonstrated for predicting
human eye fixations and salient object detection. Tseng et al. have shown that
the photographer's tendency to place interesting objects in the center is a
likely cause for the center bias of eye fixations. We investigate the influence
of the photographer's center bias on salient object detection, extending our
previous work. We show that the centroid locations of salient objects in
photographs of Achanta and Liu's data set in fact correlate strongly with a
Gaussian model. This is an important insight, because it provides an empirical
motivation and justification for the integration of such a center bias in
salient object detection algorithms and helps to understand why Gaussian models
are so effective. To assess the influence of the center bias on salient object
detection, we integrate an explicit Gaussian center bias model into two
state-of-the-art salient object detection algorithms. This way, first, we
quantify the influence of the Gaussian center bias on pixel- and segment-based
salient object detection. Second, we improve the performance in terms of F1
score, Fb score, area under the recall-precision curve, area under the receiver
operating characteristic curve, and hit-rate on the well-known data set by
Achanta and Liu. Third, by debiasing Cheng et al.'s region contrast model, we
exemplarily demonstrate that implicit center biases are partially responsible
for the outstanding performance of state-of-the-art algorithms. Last but not
least, as a result of debiasing Cheng et al.'s algorithm, we introduce a
non-biased salient object detection method, which is of interest for
applications in which the image data is not likely to have a photographer's
center bias (e.g., image data of surveillance cameras or autonomous robots)
A computer vision model for visual-object-based attention and eye movements
This is the post-print version of the final paper published in Computer Vision and Image Understanding. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2008 Elsevier B.V.This paper presents a new computational framework for modelling visual-object-based attention and attention-driven eye movements within an integrated system in a biologically inspired approach. Attention operates at multiple levels of visual selection by space, feature, object and group depending on the nature of targets and visual tasks. Attentional shifts and gaze shifts are constructed upon their common process circuits and control mechanisms but also separated from their different function roles, working together to fulfil flexible visual selection tasks in complicated visual environments. The framework integrates the important aspects of human visual attention and eye movements resulting in sophisticated performance in complicated natural scenes. The proposed approach aims at exploring a useful visual selection system for computer vision, especially for usage in cluttered natural visual environments.National Natural Science of Founda-
tion of Chin
Enabling Depth-driven Visual Attention on the iCub Humanoid Robot: Instructions for Use and New Perspectives
The importance of depth perception in the interactions that humans have
within their nearby space is a well established fact. Consequently, it is also
well known that the possibility of exploiting good stereo information would
ease and, in many cases, enable, a large variety of attentional and interactive
behaviors on humanoid robotic platforms. However, the difficulty of computing
real-time and robust binocular disparity maps from moving stereo cameras often
prevents from relying on this kind of cue to visually guide robots' attention
and actions in real-world scenarios. The contribution of this paper is
two-fold: first, we show that the Efficient Large-scale Stereo Matching
algorithm (ELAS) by A. Geiger et al. 2010 for computation of the disparity map
is well suited to be used on a humanoid robotic platform as the iCub robot;
second, we show how, provided with a fast and reliable stereo system,
implementing relatively challenging visual behaviors in natural settings can
require much less effort. As a case of study we consider the common situation
where the robot is asked to focus the attention on one object close in the
scene, showing how a simple but effective disparity-based segmentation solves
the problem in this case. Indeed this example paves the way to a variety of
other similar applications
- …