2,911 research outputs found
Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems
Predicting the future location of vehicles is essential for safety-critical
applications such as advanced driver assistance systems (ADAS) and autonomous
driving. This paper introduces a novel approach to simultaneously predict both
the location and scale of target vehicles in the first-person (egocentric) view
of an ego-vehicle. We present a multi-stream recurrent neural network (RNN)
encoder-decoder model that separately captures both object location and scale
and pixel-level observations for future vehicle localization. We show that
incorporating dense optical flow improves prediction results significantly
since it captures information about motion as well as appearance change. We
also find that explicitly modeling future motion of the ego-vehicle improves
the prediction accuracy, which could be especially beneficial in intelligent
and automated vehicles that have motion planning capability. To evaluate the
performance of our approach, we present a new dataset of first-person videos
collected from a variety of scenarios at road intersections, which are
particularly challenging moments for prediction because vehicle trajectories
are diverse and dynamic.Comment: To appear on ICRA 201
Future Person Localization in First-Person Videos
We present a new task that predicts future locations of people observed in
first-person videos. Consider a first-person video stream continuously recorded
by a wearable camera. Given a short clip of a person that is extracted from the
complete stream, we aim to predict that person's location in future frames. To
facilitate this future person localization ability, we make the following three
key observations: a) First-person videos typically involve significant
ego-motion which greatly affects the location of the target person in future
frames; b) Scales of the target person act as a salient cue to estimate a
perspective effect in first-person videos; c) First-person videos often capture
people up-close, making it easier to leverage target poses (e.g., where they
look) for predicting their future locations. We incorporate these three
observations into a prediction framework with a multi-stream
convolution-deconvolution architecture. Experimental results reveal our method
to be effective on our new dataset as well as on a public social interaction
dataset.Comment: Accepted to CVPR 201
- …