68,172 research outputs found
Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems
Predicting the future location of vehicles is essential for safety-critical
applications such as advanced driver assistance systems (ADAS) and autonomous
driving. This paper introduces a novel approach to simultaneously predict both
the location and scale of target vehicles in the first-person (egocentric) view
of an ego-vehicle. We present a multi-stream recurrent neural network (RNN)
encoder-decoder model that separately captures both object location and scale
and pixel-level observations for future vehicle localization. We show that
incorporating dense optical flow improves prediction results significantly
since it captures information about motion as well as appearance change. We
also find that explicitly modeling future motion of the ego-vehicle improves
the prediction accuracy, which could be especially beneficial in intelligent
and automated vehicles that have motion planning capability. To evaluate the
performance of our approach, we present a new dataset of first-person videos
collected from a variety of scenarios at road intersections, which are
particularly challenging moments for prediction because vehicle trajectories
are diverse and dynamic.Comment: To appear on ICRA 201
Deep Detection of People and their Mobility Aids for a Hospital Robot
Robots operating in populated environments encounter many different types of
people, some of whom might have an advanced need for cautious interaction,
because of physical impairments or their advanced age. Robots therefore need to
recognize such advanced demands to provide appropriate assistance, guidance or
other forms of support. In this paper, we propose a depth-based perception
pipeline that estimates the position and velocity of people in the environment
and categorizes them according to the mobility aids they use: pedestrian,
person in wheelchair, person in a wheelchair with a person pushing them, person
with crutches and person using a walker. We present a fast region proposal
method that feeds a Region-based Convolutional Network (Fast R-CNN). With this,
we speed up the object detection process by a factor of seven compared to a
dense sliding window approach. We furthermore propose a probabilistic position,
velocity and class estimator to smooth the CNN's detections and account for
occlusions and misclassifications. In addition, we introduce a new hospital
dataset with over 17,000 annotated RGB-D images. Extensive experiments confirm
that our pipeline successfully keeps track of people and their mobility aids,
even in challenging situations with multiple people from different categories
and frequent occlusions. Videos of our experiments and the dataset are
available at http://www2.informatik.uni-freiburg.de/~kollmitz/MobilityAidsComment: 7 pages, ECMR 2017, dataset and videos:
http://www2.informatik.uni-freiburg.de/~kollmitz/MobilityAids
Multi-View Face Recognition From Single RGBD Models of the Faces
This work takes important steps towards solving the following problem of current interest: Assuming that each individual in a population can be modeled by a single frontal RGBD face image, is it possible to carry out face recognition for such a population using multiple 2D images captured from arbitrary viewpoints? Although the general problem as stated above is extremely challenging, it encompasses subproblems that can be addressed today. The subproblems addressed in this work relate to: (1) Generating a large set of viewpoint dependent face images from a single RGBD frontal image for each individual; (2) using hierarchical approaches based on view-partitioned subspaces to represent the training data; and (3) based on these hierarchical approaches, using a weighted voting algorithm to integrate the evidence collected from multiple images of the same face as recorded from different viewpoints. We evaluate our methods on three datasets: a dataset of 10 people that we created and two publicly available datasets which include a total of 48 people. In addition to providing important insights into the nature of this problem, our results show that we are able to successfully recognize faces with accuracies of 95% or higher, outperforming existing state-of-the-art face recognition approaches based on deep convolutional neural networks
Deep Poselets for Human Detection
We address the problem of detecting people in natural scenes using a part
approach based on poselets. We propose a bootstrapping method that allows us to
collect millions of weakly labeled examples for each poselet type. We use these
examples to train a Convolutional Neural Net to discriminate different poselet
types and separate them from the background class. We then use the trained CNN
as a way to represent poselet patches with a Pose Discriminative Feature (PDF)
vector -- a compact 256-dimensional feature vector that is effective at
discriminating pose from appearance. We train the poselet model on top of PDF
features and combine them with object-level CNNs for detection and bounding box
prediction. The resulting model leads to state-of-the-art performance for human
detection on the PASCAL datasets
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Existing deep convolutional neural networks (CNNs) require a fixed-size
(e.g., 224x224) input image. This requirement is "artificial" and may reduce
the recognition accuracy for the images or sub-images of an arbitrary
size/scale. In this work, we equip the networks with another pooling strategy,
"spatial pyramid pooling", to eliminate the above requirement. The new network
structure, called SPP-net, can generate a fixed-length representation
regardless of image size/scale. Pyramid pooling is also robust to object
deformations. With these advantages, SPP-net should in general improve all
CNN-based image classification methods. On the ImageNet 2012 dataset, we
demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures
despite their different designs. On the Pascal VOC 2007 and Caltech101
datasets, SPP-net achieves state-of-the-art classification results using a
single full-image representation and no fine-tuning.
The power of SPP-net is also significant in object detection. Using SPP-net,
we compute the feature maps from the entire image only once, and then pool
features in arbitrary regions (sub-images) to generate fixed-length
representations for training the detectors. This method avoids repeatedly
computing the convolutional features. In processing test images, our method is
24-102x faster than the R-CNN method, while achieving better or comparable
accuracy on Pascal VOC 2007.
In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our
methods rank #2 in object detection and #3 in image classification among all 38
teams. This manuscript also introduces the improvement made for this
competition.Comment: This manuscript is the accepted version for IEEE Transactions on
Pattern Analysis and Machine Intelligence (TPAMI) 2015. See Changelo
Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation
We introduce a new loss function for the weakly-supervised training of
semantic image segmentation models based on three guiding principles: to seed
with weak localization cues, to expand objects based on the information about
which classes can occur in an image, and to constrain the segmentations to
coincide with object boundaries. We show experimentally that training a deep
convolutional neural network using the proposed loss function leads to
substantially better segmentations than previous state-of-the-art methods on
the challenging PASCAL VOC 2012 dataset. We furthermore give insight into the
working mechanism of our method by a detailed experimental study that
illustrates how the segmentation quality is affected by each term of the
proposed loss function as well as their combinations.Comment: ECCV 201
- …