39 research outputs found
Unsupervised Segmentation of Action Segments in Egocentric Videos using Gaze
Unsupervised segmentation of action segments in egocentric videos is a
desirable feature in tasks such as activity recognition and content-based video
retrieval. Reducing the search space into a finite set of action segments
facilitates a faster and less noisy matching. However, there exist a
substantial gap in machine understanding of natural temporal cuts during a
continuous human activity. This work reports on a novel gaze-based approach for
segmenting action segments in videos captured using an egocentric camera. Gaze
is used to locate the region-of-interest inside a frame. By tracking two simple
motion-based parameters inside successive regions-of-interest, we discover a
finite set of temporal cuts. We present several results using combinations (of
the two parameters) on a dataset, i.e., BRISGAZE-ACTIONS. The dataset contains
egocentric videos depicting several daily-living activities. The quality of the
temporal cuts is further improved by implementing two entropy measures.Comment: To appear in 2017 IEEE International Conference On Signal and Image
Processing Application
Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems
Predicting the future location of vehicles is essential for safety-critical
applications such as advanced driver assistance systems (ADAS) and autonomous
driving. This paper introduces a novel approach to simultaneously predict both
the location and scale of target vehicles in the first-person (egocentric) view
of an ego-vehicle. We present a multi-stream recurrent neural network (RNN)
encoder-decoder model that separately captures both object location and scale
and pixel-level observations for future vehicle localization. We show that
incorporating dense optical flow improves prediction results significantly
since it captures information about motion as well as appearance change. We
also find that explicitly modeling future motion of the ego-vehicle improves
the prediction accuracy, which could be especially beneficial in intelligent
and automated vehicles that have motion planning capability. To evaluate the
performance of our approach, we present a new dataset of first-person videos
collected from a variety of scenarios at road intersections, which are
particularly challenging moments for prediction because vehicle trajectories
are diverse and dynamic.Comment: To appear on ICRA 201
Early Recognition of Human Activities from First-Person Videos Using Onset Representations
In this paper, we propose a methodology for early recognition of human
activities from videos taken with a first-person viewpoint. Early recognition,
which is also known as activity prediction, is an ability to infer an ongoing
activity at its early stage. We present an algorithm to perform recognition of
activities targeted at the camera from streaming videos, making the system to
predict intended activities of the interacting person and avoid harmful events
before they actually happen. We introduce the novel concept of 'onset' that
efficiently summarizes pre-activity observations, and design an approach to
consider event history in addition to ongoing video observation for early
first-person recognition of activities. We propose to represent onset using
cascade histograms of time series gradients, and we describe a novel
algorithmic setup to take advantage of onset for early recognition of
activities. The experimental results clearly illustrate that the proposed
concept of onset enables better/earlier recognition of human activities from
first-person videos
Learning Robot Activities from First-Person Human Videos Using Convolutional Future Regression
We design a new approach that allows robot learning of new activities from
unlabeled human example videos. Given videos of humans executing the same
activity from a human's viewpoint (i.e., first-person videos), our objective is
to make the robot learn the temporal structure of the activity as its future
regression network, and learn to transfer such model for its own motor
execution. We present a new deep learning model: We extend the state-of-the-art
convolutional object detection network for the representation/estimation of
human hands in training videos, and newly introduce the concept of using a
fully convolutional network to regress (i.e., predict) the intermediate scene
representation corresponding to the future frame (e.g., 1-2 seconds later).
Combining these allows direct prediction of future locations of human hands and
objects, which enables the robot to infer the motor control plan using our
manipulation network. We experimentally confirm that our approach makes
learning of robot activities from unlabeled human interaction videos possible,
and demonstrate that our robot is able to execute the learned collaborative
activities in real-time directly based on its camera input