17 research outputs found
Image Feature Information Extraction for Interest Point Detection: A Comprehensive Review
Interest point detection is one of the most fundamental and critical problems
in computer vision and image processing. In this paper, we carry out a
comprehensive review on image feature information (IFI) extraction techniques
for interest point detection. To systematically introduce how the existing
interest point detection methods extract IFI from an input image, we propose a
taxonomy of the IFI extraction techniques for interest point detection.
According to this taxonomy, we discuss different types of IFI extraction
techniques for interest point detection. Furthermore, we identify the main
unresolved issues related to the existing IFI extraction techniques for
interest point detection and any interest point detection methods that have not
been discussed before. The existing popular datasets and evaluation standards
are provided and the performances for eighteen state-of-the-art approaches are
evaluated and discussed. Moreover, future research directions on IFI extraction
techniques for interest point detection are elaborated
Robustness of multimodal 3D object detection using deep learning approach fo autonomous vehicles
Dans cette thèse, nous étudions la robustesse d’un modèle multimodal de détection d’objets en 3D dans le contexte de véhicules autonomes. Les véhicules autonomes doivent détecter et localiser avec précision les piétons et les autres véhicules dans leur environnement 3D afin de conduire sur les routes en toute sécurité. La robustesse est l’un des aspects les plus importants d’un algorithme dans le problème de la perception 3D pour véhicules autonomes. C’est pourquoi, dans cette thèse, nous avons proposé une méthode pour évaluer la robustesse d’un modèle de détecteur d’objets en 3D. À cette fin, nous avons formé un détecteur d’objets 3D multimodal représentatif sur trois ensembles de données différents et nous avons effectué des tests sur des ensembles de données qui ont été construits avec précision pour démontrer la robustesse du modèle formé dans diverses conditions météorologiques et de luminosité. Notre méthode utilise deux approches différentes pour construire les ensembles de données proposés afin d’évaluer la robustesse. Dans une approche, nous avons utilisé des images artificiellement corrompues et dans l’autre, nous avons utilisé les images réelles dans des conditions météorologiques et de luminosité extrêmes. Afin de détecter des objets tels que des voitures et des piétons dans les scènes de circulation, le modèle multimodal s’appuie sur des images et des nuages de points 3D. Les approches multimodales pour la détection d’objets en 3D exploitent différents capteurs tels que des caméras et des détecteurs de distance pour détecter les objets d’intérêt dans l’environnement. Nous avons exploité trois ensembles de données bien connus dans le domaine de la conduite autonome, à savoir KITTI, nuScenes et Waymo. Nous avons mené des expériences approfondies pour étudier la méthode proposée afin d’évaluer la robustesse du modèle et nous avons fourni des résultats quantitatifs et qualitatifs. Nous avons observé que la méthode que nous proposons peut mesurer efficacement la robustesse du modèle.In this thesis, we study the robustness of a multimodal 3D object detection model in the context of autonomous vehicles. Self-driving cars need to accurately detect and localize pedestrians and other vehicles in their 3D surrounding environment to drive on the roads safely. Robustness is one of the most critical aspects of an algorithm in the self-driving car 3D perception problem. Therefore, in this work, we proposed a method to evaluate a 3D object detector’s robustness. To this end, we have trained a representative multimodal 3D object detector on three different datasets. Afterward, we evaluated the trained model on datasets that we have proposed and made to assess the robustness of the trained models in diverse weather and lighting conditions. Our method uses two different approaches for building the proposed datasets for evaluating the robustness. In one approach, we used artificially corrupted images, and in the other one, we used the real images captured in diverse weather and lighting conditions. To detect objects such as cars and pedestrians in the traffic scenes, the multimodal model relies on images and 3D point clouds. Multimodal approaches for 3D object detection exploit different sensors such as camera and range detectors for detecting the objects of interest in the surrounding environment. We leveraged three well-known datasets in the domain of autonomous driving consist of KITTI, nuScenes, and Waymo. We conducted extensive experiments to investigate the proposed method for evaluating the model’s robustness and provided quantitative and qualitative results. We observed that our proposed method can measure the robustness of the model effectively
3D Ground Truth Generation Using Pre-Trained Deep Neural Networks
Training 3D object detectors on publicly available data has been limited to small datasets
due to the large amount of effort required to generate annotations. The difficulty of labeling
in 3D using 2.5D sensors, such as LIDAR, is attributed to the high spatial reasoning skills
required to deal with occlusion and partial viewpoints. Additionally, the current methods
to label 3D objects are cognitively demanding due to frequent task switching. Reducing
both task complexity and the amount of task switching done by annotators is key to
reducing the effort and time required to generate 3D bounding box annotations. We
therefore seek to reduce the burden on the annotators by leveraging existing 3D object
detectors using deep neural networks.
This work introduces a novel ground truth generation method that combines human
supervision with pre-trained neural networks to generate per-instance 3D point cloud seg-
mentation, 3D bounding boxes, and class annotations. The annotators provide object
anchor clicks which behave as a seed to generate instance segmentation results in 3D. The
points belonging to each instance are then used to regress object centroids, bounding box
dimensions, and object orientation. The deep neural network model used to generate the
segmentation masks and bounding box parameters is based on the PointNet architecture.
We develop our approach with reliance on the KITTI dataset to analyze the quality
of the generated ground truth. The neural network model is trained on KITTI training
split and the 3D bounding box outputs are generated using annotation clicks collected
from the validation split. The validation split of KITTI detection dataset contains 3712
frames of pointcloud and image scenes and it took 16.35 hours to label with the following
method. Based on these results, our approach is 19 times faster than the latest published
3D object annotation scheme. Additionally, it is found that the annotators spent less
time per object as the number of objects in the scenes increase, making it a very efficient
for multi-object labeling. Furthermore, the quality of the generated 3D bounding boxes,
using the labeling method, is compared against the KITTI ground truth. It is shown that
the model performs on par with the current state-of-the-art 3D detectors and the labeling
procedure does not negatively impact the output quality of the bounding boxes. Lastly, the
proposed scheme is applied to previously unseen data from the Autonomoose self-driving
vehicle to demonstrate generalization capabilities of the network
A PhD Dissertation on Road Topology Classification for Autonomous Driving
La clasificaci´on de la topolog´ıa de la carretera es un punto clave si queremos desarrollar
sistemas de conducci´on aut´onoma completos y seguros. Es l´ogico pensar que la comprensi
´on de forma exhaustiva del entorno que rodea al vehiculo, tal como sucede cuando es
un ser humano el que toma las decisiones al volante, es una condici´on indispensable si se
quiere avanzar en la consecuci´on de veh´ıculos aut´onomos de nivel 4 o 5. Si el conductor,
ya sea un sistema aut´onomo, como un ser humano, no tiene acceso a la informaci´on del
entorno la disminuci´on de la seguridad es cr´ıtica y el accidente es casi instant´aneo i.e.,
cuando un conductor se duerme al volante.
A lo largo de esta tesis doctoral se presentan sendos sistemas basados en deep leaning
que ayudan al sistema de conducci´on aut´onoma a comprender el entorno en el que se
encuentra en ese instante. El primero de ellos 3D-Deep y su optimizaci´on 3D-Deepest,
es una nueva arquitectura de red para la segmentaci´on sem´antica de carretera en el que
se integran fuentes de datos de diferente tipolog´ıa. La segmentaci´on de carretera es clave
en un veh´ıculo aut´onomo, ya que es el medio por el que deber´ıa circular en el 99,9% de
los casos. El segundo es un sistema de clasificaci´on de intersecciones urbanas mediante
diferentes enfoques comprendidos dentro del metric-learning, la integraci´on temporal y la
generaci´on de im´agenes sint´eticas. La seguridad es un punto clave en cualquier sistema
aut´onomo, y si es de conducci´on a´un m´as. Las intersecciones son uno de los lugares dentro
de las ciudades donde la seguridad es cr´ıtica. Los coches siguen trayectorias secantes y por
tanto pueden colisionar, la mayor´ıa de ellas son usadas por los peatones para atravesar
la v´ıa independientemente de si existen pasos de cebra o no, lo que incrementa de forma
alarmante los riesgos de atropello y colisi´on.
La implementaci´on de la combinaci´on de ambos sistemas mejora substancialmente la
comprensi´on del entorno, y puede considerarse que incrementa la seguridad, allanando el
camino en la investigaci´on hacia un veh´ıculo completamente aut´onomo.Road topology classification is a crucial point if we want to develop complete and safe
autonomous driving systems. It is logical to think that a thorough understanding of
the environment surrounding the ego-vehicle, as it happens when a human being is a
decision-maker at the wheel, is an indispensable condition if we want to advance in the
achievement of level 4 or 5 autonomous vehicles. If the driver, either an autonomous
system or a human being, does not have access to the information of the environment,
the decrease in safety is critical, and the accident is almost instantaneous, i.e., when a
driver falls asleep at the wheel.
Throughout this doctoral thesis, we present two deep learning systems that will help
an autonomous driving system understand the environment in which it is at that instant.
The first one, 3D-Deep and its optimization 3D-Deepest, is a new network architecture
for semantic road segmentation in which data sources of different types are integrated.
Road segmentation is vital in an autonomous vehicle since it is the medium on which
it should drive in 99.9% of the cases. The second is an urban intersection classification
system using different approaches comprised of metric-learning, temporal integration, and
synthetic image generation. Safety is a crucial point in any autonomous system, and if it
is a driving system, even more so. Intersections are one of the places within cities where
safety is critical. Cars follow secant trajectories and therefore can collide; most of them
are used by pedestrians to cross the road regardless of whether there are crosswalks or
not, which alarmingly increases the risks of being hit and collision.
The implementation of the combination of both systems substantially improves the
understanding of the environment and can be considered to increase safety, paving the
way in the research towards a fully autonomous vehicle
Deep reinforcement learning for multi-modal embodied navigation
Ce travail se concentre sur une tâche de micro-navigation en plein air où le but est de naviguer
vers une adresse de rue spécifiée en utilisant plusieurs modalités (par exemple, images, texte
de scène et GPS). La tâche de micro-navigation extérieure s’avère etre un défi important pour
de nombreuses personnes malvoyantes, ce que nous démontrons à travers des entretiens et
des études de marché, et nous limitons notre définition des problèmes à leurs besoins. Nous
expérimentons d’abord avec un monde en grille partiellement observable (Grid-Street et Grid
City) contenant des maisons, des numéros de rue et des régions navigables. Ensuite, nous
introduisons le Environnement de Trottoir pour la Navigation Visuelle (ETNV), qui contient
des images panoramiques avec des boîtes englobantes pour les numéros de maison, les portes
et les panneaux de nom de rue, et des formulations pour plusieurs tâches de navigation. Dans
SEVN, nous formons un modèle de politique pour fusionner des observations multimodales
sous la forme d’images à résolution variable, de texte visible et de données GPS simulées afin
de naviguer vers une porte d’objectif. Nous entraînons ce modèle en utilisant l’algorithme
d’apprentissage par renforcement, Proximal Policy Optimization (PPO). Nous espérons que
cette thèse fournira une base pour d’autres recherches sur la création d’agents pouvant aider
les membres de la communauté des gens malvoyantes à naviguer le monde.This work focuses on an Outdoor Micro-Navigation (OMN) task in which the goal is to
navigate to a specified street address using multiple modalities including images, scene-text,
and GPS. This task is a significant challenge to many Blind and Visually Impaired (BVI)
people, which we demonstrate through interviews and market research. To investigate the
feasibility of solving this task with Deep Reinforcement Learning (DRL), we first introduce
two partially observable grid-worlds, Grid-Street and Grid City, containing houses, street
numbers, and navigable regions. In these environments, we train an agent to find specific
houses using local observations under a variety of training procedures. We parameterize
our agent with a neural network and train using reinforcement learning methods. Next, we
introduce the Sidewalk Environment for Visual Navigation (SEVN), which contains panoramic
images with labels for house numbers, doors, and street name signs, and formulations for
several navigation tasks. In SEVN, we train another neural network model using Proximal
Policy Optimization (PPO) to fuse multi-modal observations in the form of variable resolution
images, visible text, and simulated GPS data, and to use this representation to navigate to
goal doors. Our best model used all available modalities and was able to navigate to over 100
goals with an 85% success rate. We found that models with access to only a subset of these
modalities performed significantly worse, supporting the need for a multi-modal approach to
the OMN task. We hope that this thesis provides a foundation for further research into the
creation of agents to assist members of the BVI community to safely navigate