3,886 research outputs found
Audiovisual data fusion for successive speakers tracking
International audienceIn this paper, a human speaker tracking method on audio and video data is presented. It is applied to con- versation tracking with a robot. Audiovisual data fusion is performed in a two-steps process. Detection is performed independently on each modality: face detection based on skin color on video data and sound source localization based on the time delay of arrival on audio data. The results of those detection processes are then fused thanks to an adaptation of bayesian filter to detect the speaker. The robot is able to detect the face of the talking person and to detect a new speaker in a conversation
Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras
Visual scene understanding is an important capability that enables robots to
purposefully act in their environment. In this paper, we propose a novel
approach to object-class segmentation from multiple RGB-D views using deep
learning. We train a deep neural network to predict object-class semantics that
is consistent from several view points in a semi-supervised way. At test time,
the semantics predictions of our network can be fused more consistently in
semantic keyframe maps than predictions of a network trained on individual
views. We base our network architecture on a recent single-view deep learning
approach to RGB and depth fusion for semantic object-class segmentation and
enhance it with multi-scale loss minimization. We obtain the camera trajectory
using RGB-D SLAM and warp the predictions of RGB-D images into ground-truth
annotated frames in order to enforce multi-view consistency during training. At
test time, predictions from multiple views are fused into keyframes. We propose
and analyze several methods for enforcing multi-view consistency during
training and testing. We evaluate the benefit of multi-view consistency
training and demonstrate that pooling of deep features and fusion over multiple
views outperforms single-view baselines on the NYUDv2 benchmark for semantic
segmentation. Our end-to-end trained network achieves state-of-the-art
performance on the NYUDv2 dataset in single-view segmentation as well as
multi-view semantic fusion.Comment: the 2017 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2017
Data fusion in ubiquitous networked robot systems for urban services
There is a clear trend in the use of robots
to accomplish services that can help humans. In this
paper, robots acting in urban environments are considered
for the task of person guiding. Nowadays, it is
common to have ubiquitous sensors integrated within
the buildings, such as camera networks, and wireless
communications like 3G or WiFi. Such infrastructure
can be directly used by robotic platforms. The paper
shows how combining the information from the robots
and the sensors allows tracking failures to be overcome,
by being more robust under occlusion, clutter, and
lighting changes. The paper describes the algorithms
for tracking with a set of fixed surveillance cameras
and the algorithms for position tracking using the signal
strength received by a wireless sensor network (WSN).
Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is
described. The estimate from all these sources are then
combined using a decentralized data fusion algorithm
to provide an increase in performance. This scheme is
scalable and can handle communication latencies and
failures. We present results of the system operating in
real time on a large outdoor environment, including 22
nonoverlapping cameras, WSN, and several robots.Universidad Pablo de Olavide. Departamento de Deporte e InformáticaPostprin
Dynamic Occupancy Grid Prediction for Urban Autonomous Driving: A Deep Learning Approach with Fully Automatic Labeling
Long-term situation prediction plays a crucial role in the development of
intelligent vehicles. A major challenge still to overcome is the prediction of
complex downtown scenarios with multiple road users, e.g., pedestrians, bikes,
and motor vehicles, interacting with each other. This contribution tackles this
challenge by combining a Bayesian filtering technique for environment
representation, and machine learning as long-term predictor. More specifically,
a dynamic occupancy grid map is utilized as input to a deep convolutional
neural network. This yields the advantage of using spatially distributed
velocity estimates from a single time step for prediction, rather than a raw
data sequence, alleviating common problems dealing with input time series of
multiple sensors. Furthermore, convolutional neural networks have the inherent
characteristic of using context information, enabling the implicit modeling of
road user interaction. Pixel-wise balancing is applied in the loss function
counteracting the extreme imbalance between static and dynamic cells. One of
the major advantages is the unsupervised learning character due to fully
automatic label generation. The presented algorithm is trained and evaluated on
multiple hours of recorded sensor data and compared to Monte-Carlo simulation
- …