    3D Object Detection for Autonomous Driving: A Survey

    Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of such perception system especially for the sake of path planning, motion prediction, collision avoidance, etc. Generally, stereo or monocular images with corresponding 3D point clouds are already standard layout for 3D object detection, out of which point clouds are increasingly prevalent with accurate depth information being provided. Despite existing efforts, 3D object detection on point clouds is still in its infancy due to high sparseness and irregularity of point clouds by nature, misalignment view between camera view and LiDAR bird's eye of view for modality synergies, occlusions and scale variations at long distances, etc. Recently, profound progress has been made in 3D object detection, with a large body of literature being investigated to address this vision task. As such, we present a comprehensive review of the latest progress in this field covering all the main topics including sensors, fundamentals, and the recent state-of-the-art detection methods with their pros and cons. Furthermore, we introduce metrics and provide quantitative comparisons on popular public datasets. The avenues for future work are going to be judiciously identified after an in-deep analysis of the surveyed works. Finally, we conclude this paper.Comment: 3D object detection, Autonomous driving, Point cloud

    Building an Aerial-Ground Robotics System for Precision Farming: An Adaptable Solution

    The application of autonomous robots in agriculture is gaining increasing popularity thanks to the high impact it may have on food security, sustainability, resource use efficiency, reduction of chemical treatments, and the optimization of human effort and yield. With this vision, the Flourish research project aimed to develop an adaptable robotic solution for precision farming that combines the aerial survey capabilities of small autonomous unmanned aerial vehicles (UAVs) with targeted intervention performed by multi-purpose unmanned ground vehicles (UGVs). This paper presents an overview of the scientific and technological advances and outcomes obtained in the project. We introduce multi-spectral perception algorithms and aerial and ground-based systems developed for monitoring crop density, weed pressure, crop nitrogen nutrition status, and to accurately classify and locate weeds. We then introduce the navigation and mapping systems tailored to our robots in the agricultural environment, as well as the modules for collaborative mapping. We finally present the ground intervention hardware, software solutions, and interfaces we implemented and tested in different field conditions and with different crops. We describe a real use case in which a UAV collaborates with a UGV to monitor the field and to perform selective spraying without human intervention.Comment: Published in IEEE Robotics & Automation Magazine, vol. 28, no. 3, pp. 29-49, Sept. 202

    Image Understands Point Cloud: Weakly Supervised 3D Semantic Segmentation via Association Learning

    Weakly supervised point cloud semantic segmentation methods that require 1\% or fewer labels, hoping to realize almost the same performance as fully supervised approaches, which recently, have attracted extensive research attention. A typical solution in this framework is to use self-training or pseudo labeling to mine the supervision from the point cloud itself, but ignore the critical information from images. In fact, cameras widely exist in LiDAR scenarios and this complementary information seems to be greatly important for 3D applications. In this paper, we propose a novel cross-modality weakly supervised method for 3D segmentation, incorporating complementary information from unlabeled images. Basically, we design a dual-branch network equipped with an active labeling strategy, to maximize the power of tiny parts of labels and directly realize 2D-to-3D knowledge transfer. Afterwards, we establish a cross-modal self-training framework in an Expectation-Maximum (EM) perspective, which iterates between pseudo labels estimation and parameters updating. In the M-Step, we propose a cross-modal association learning to mine complementary supervision from images by reinforcing the cycle-consistency between 3D points and 2D superpixels. In the E-step, a pseudo label self-rectification mechanism is derived to filter noise labels thus providing more accurate labels for the networks to get fully trained. The extensive experimental results demonstrate that our method even outperforms the state-of-the-art fully supervised competitors with less than 1\% actively selected annotations

    Accurate, fast, and robust 3D city-scale reconstruction using wide area motion imagery

    Multi-view stereopsis (MVS) is a core problem in computer vision, which takes a set of scene views together with known camera poses, then produces a geometric representation of the underlying 3D model Using 3D reconstruction one can determine any object's 3D profile, as well as knowing the 3D coordinate of any point on the profile. The 3D reconstruction of objects is a generally scientific problem and core technology of a wide variety of fields, such as Computer Aided Geometric Design (CAGD), computer graphics, computer animation, computer vision, medical imaging, computational science, virtual reality, digital media, etc. However, though MVS problems have been studied for decades, many challenges still exist in current state-of-the-art algorithms, for example, many algorithms still lack accuracy and completeness when tested on city-scale large datasets, most MVS algorithms available require a large amount of execution time and/or specialized hardware and software, which results in high cost, and etc... This dissertation work tries to address all the challenges we mentioned, and proposed multiple solutions. More specifically, this dissertation work proposed multiple novel MVS algorithms to automatically and accurately reconstruct the underlying 3D scenes. By proposing a novel volumetric voxel-based method, one of our algorithms achieved near real-time runtime speed, which does not require any special hardware or software, and can be deployed onto power-constrained embedded systems. By developing a new camera clustering module and a novel weighted voting-based surface likelihood estimation module, our algorithm is generalized to process di erent datasets, and achieved the best performance in terms of accuracy and completeness when compared with existing algorithms. This dissertation work also performs the very first quantitative evaluation in terms of precision, recall, and F-score using real-world LiDAR groundtruth data. Last but not least, this dissertation work proposes an automatic workflow, which can stitch multiple point cloud models with limited overlapping areas into one larger 3D model for better geographical coverage. All the results presented in this dissertation work have been evaluated in our wide area motion imagery (WAMI) dataset, and improved the state-of-the-art performances by a large margin.The generated results from this dissertation work have been successfully used in many aspects, including: city digitization, improving detection and tracking performances, real time dynamic shadow detection, 3D change detection, visibility map generating, VR environment, and visualization combined with other information, such as building footprint and roads.Includes bibliographical references

    Non-contact Multimodal Indoor Human Monitoring Systems: A Survey

    Indoor human monitoring systems leverage a wide range of sensors, including cameras, radio devices, and inertial measurement units, to collect extensive data from users and the environment. These sensors contribute diverse data modalities, such as video feeds from cameras, received signal strength indicators and channel state information from WiFi devices, and three-axis acceleration data from inertial measurement units. In this context, we present a comprehensive survey of multimodal approaches for indoor human monitoring systems, with a specific focus on their relevance in elderly care. Our survey primarily highlights non-contact technologies, particularly cameras and radio devices, as key components in the development of indoor human monitoring systems. Throughout this article, we explore well-established techniques for extracting features from multimodal data sources. Our exploration extends to methodologies for fusing these features and harnessing multiple modalities to improve the accuracy and robustness of machine learning models. Furthermore, we conduct comparative analysis across different data modalities in diverse human monitoring tasks and undertake a comprehensive examination of existing multimodal datasets. This extensive survey not only highlights the significance of indoor human monitoring systems but also affirms their versatile applications. In particular, we emphasize their critical role in enhancing the quality of elderly care, offering valuable insights into the development of non-contact monitoring solutions applicable to the needs of aging populations.Comment: 19 pages, 5 figure

    Perception of Unstructured Environments for Autonomous Off-Road Vehicles

    Autonome Fahrzeuge benötigen die FĂ€higkeit zur Perzeption als eine notwendige Voraussetzung fĂŒr eine kontrollierbare und sichere Interaktion, um ihre Umgebung wahrzunehmen und zu verstehen. Perzeption fĂŒr strukturierte Innen- und Außenumgebungen deckt wirtschaftlich lukrative Bereiche, wie den autonomen Personentransport oder die Industrierobotik ab, wĂ€hrend die Perzeption unstrukturierter Umgebungen im Forschungsfeld der Umgebungswahrnehmung stark unterreprĂ€sentiert ist. Die analysierten unstrukturierten Umgebungen stellen eine besondere Herausforderung dar, da die vorhandenen, natĂŒrlichen und gewachsenen Geometrien meist keine homogene Struktur aufweisen und Ă€hnliche Texturen sowie schwer zu trennende Objekte dominieren. Dies erschwert die Erfassung dieser Umgebungen und deren Interpretation, sodass Perzeptionsmethoden speziell fĂŒr diesen Anwendungsbereich konzipiert und optimiert werden mĂŒssen. In dieser Dissertation werden neuartige und optimierte Perzeptionsmethoden fĂŒr unstrukturierte Umgebungen vorgeschlagen und in einer ganzheitlichen, dreistufigen Pipeline fĂŒr autonome GelĂ€ndefahrzeuge kombiniert: Low-Level-, Mid-Level- und High-Level-Perzeption. Die vorgeschlagenen klassischen Methoden und maschinellen Lernmethoden (ML) zur Perzeption bzw.~Wahrnehmung ergĂ€nzen sich gegenseitig. DarĂŒber hinaus ermöglicht die Kombination von Perzeptions- und Validierungsmethoden fĂŒr jede Ebene eine zuverlĂ€ssige Wahrnehmung der möglicherweise unbekannten Umgebung, wobei lose und eng gekoppelte Validierungsmethoden kombiniert werden, um eine ausreichende, aber flexible Bewertung der vorgeschlagenen Perzeptionsmethoden zu gewĂ€hrleisten. Alle Methoden wurden als einzelne Module innerhalb der in dieser Arbeit vorgeschlagenen Perzeptions- und Validierungspipeline entwickelt, und ihre flexible Kombination ermöglicht verschiedene Pipelinedesigns fĂŒr eine Vielzahl von GelĂ€ndefahrzeugen und AnwendungsfĂ€llen je nach Bedarf. Low-Level-Perzeption gewĂ€hrleistet eine eng gekoppelte Konfidenzbewertung fĂŒr rohe 2D- und 3D-Sensordaten, um SensorausfĂ€lle zu erkennen und eine ausreichende Genauigkeit der Sensordaten zu gewĂ€hrleisten. DarĂŒber hinaus werden neuartige Kalibrierungs- und RegistrierungsansĂ€tze fĂŒr Multisensorsysteme in der Perzeption vorgestellt, welche lediglich die Struktur der Umgebung nutzen, um die erfassten Sensordaten zu registrieren: ein halbautomatischer Registrierungsansatz zur Registrierung mehrerer 3D~Light Detection and Ranging (LiDAR) Sensoren und ein vertrauensbasiertes Framework, welches verschiedene Registrierungsmethoden kombiniert und die Registrierung verschiedener Sensoren mit unterschiedlichen Messprinzipien ermöglicht. Dabei validiert die Kombination mehrerer Registrierungsmethoden die Registrierungsergebnisse in einer eng gekoppelten Weise. Mid-Level-Perzeption ermöglicht die 3D-Rekonstruktion unstrukturierter Umgebungen mit zwei Verfahren zur SchĂ€tzung der DisparitĂ€t von Stereobildern: ein klassisches, korrelationsbasiertes Verfahren fĂŒr Hyperspektralbilder, welches eine begrenzte Menge an Test- und Validierungsdaten erfordert, und ein zweites Verfahren, welches die DisparitĂ€t aus Graustufenbildern mit neuronalen Faltungsnetzen (CNNs) schĂ€tzt. Neuartige DisparitĂ€tsfehlermetriken und eine Evaluierungs-Toolbox fĂŒr die 3D-Rekonstruktion von Stereobildern ergĂ€nzen die vorgeschlagenen Methoden zur DisparitĂ€tsschĂ€tzung aus Stereobildern und ermöglichen deren lose gekoppelte Validierung. High-Level-Perzeption konzentriert sich auf die Interpretation von einzelnen 3D-Punktwolken zur Befahrbarkeitsanalyse, Objekterkennung und Hindernisvermeidung. Eine DomĂ€nentransferanalyse fĂŒr State-of-the-art-Methoden zur semantischen 3D-Segmentierung liefert Empfehlungen fĂŒr eine möglichst exakte Segmentierung in neuen ZieldomĂ€nen ohne eine Generierung neuer Trainingsdaten. Der vorgestellte Trainingsansatz fĂŒr 3D-Segmentierungsverfahren mit CNNs kann die benötigte Menge an Trainingsdaten weiter reduzieren. Methoden zur ErklĂ€rbarkeit kĂŒnstlicher Intelligenz vor und nach der Modellierung ermöglichen eine lose gekoppelte Validierung der vorgeschlagenen High-Level-Methoden mit Datensatzbewertung und modellunabhĂ€ngigen ErklĂ€rungen fĂŒr CNN-Vorhersagen. Altlastensanierung und MilitĂ€rlogistik sind die beiden HauptanwendungsfĂ€lle in unstrukturierten Umgebungen, welche in dieser Arbeit behandelt werden. Diese Anwendungsszenarien zeigen auch, wie die LĂŒcke zwischen der Entwicklung einzelner Methoden und ihrer Integration in die Verarbeitungskette fĂŒr autonome GelĂ€ndefahrzeuge mit Lokalisierung, Kartierung, Planung und Steuerung geschlossen werden kann. Zusammenfassend lĂ€sst sich sagen, dass die vorgeschlagene Pipeline flexible Perzeptionslösungen fĂŒr autonome GelĂ€ndefahrzeuge bietet und die begleitende Validierung eine exakte und vertrauenswĂŒrdige Perzeption unstrukturierter Umgebungen gewĂ€hrleistet

    TractorEYE: Vision-based Real-time Detection for Autonomous Vehicles in Agriculture

    Agricultural vehicles such as tractors and harvesters have for decades been able to navigate automatically and more efficiently using commercially available products such as auto-steering and tractor-guidance systems. However, a human operator is still required inside the vehicle to ensure the safety of vehicle and especially surroundings such as humans and animals. To get fully autonomous vehicles certified for farming, computer vision algorithms and sensor technologies must detect obstacles with equivalent or better than human-level performance. Furthermore, detections must run in real-time to allow vehicles to actuate and avoid collision.This thesis proposes a detection system (TractorEYE), a dataset (FieldSAFE), and procedures to fuse information from multiple sensor technologies to improve detection of obstacles and to generate a map. TractorEYE is a multi-sensor detection system for autonomous vehicles in agriculture. The multi-sensor system consists of three hardware synchronized and registered sensors (stereo camera, thermal camera and multi-beam lidar) mounted on/in a ruggedized and water-resistant casing. Algorithms have been developed to run a total of six detection algorithms (four for rgb camera, one for thermal camera and one for a Multi-beam lidar) and fuse detection information in a common format using either 3D positions or Inverse Sensor Models. A GPU powered computational platform is able to run detection algorithms online. For the rgb camera, a deep learning algorithm is proposed DeepAnomaly to perform real-time anomaly detection of distant, heavy occluded and unknown obstacles in agriculture. DeepAnomaly is -- compared to a state-of-the-art object detector Faster R-CNN -- for an agricultural use-case able to detect humans better and at longer ranges (45-90m) using a smaller memory footprint and 7.3-times faster processing. Low memory footprint and fast processing makes DeepAnomaly suitable for real-time applications running on an embedded GPU. FieldSAFE is a multi-modal dataset for detection of static and moving obstacles in agriculture. The dataset includes synchronized recordings from a rgb camera, stereo camera, thermal camera, 360-degree camera, lidar and radar. Precise localization and pose is provided using IMU and GPS. Ground truth of static and moving obstacles (humans, mannequin dolls, barrels, buildings, vehicles, and vegetation) are available as an annotated orthophoto and GPS coordinates for moving obstacles. Detection information from multiple detection algorithms and sensors are fused into a map using Inverse Sensor Models and occupancy grid maps. This thesis presented many scientific contribution and state-of-the-art within perception for autonomous tractors; this includes a dataset, sensor platform, detection algorithms and procedures to perform multi-sensor fusion. Furthermore, important engineering contributions to autonomous farming vehicles are presented such as easily applicable, open-source software packages and algorithms that have been demonstrated in an end-to-end real-time detection system. The contributions of this thesis have demonstrated, addressed and solved critical issues to utilize camera-based perception systems that are essential to make autonomous vehicles in agriculture a reality

    Depth Estimation Using 2D RGB Images

    Single image depth estimation is an ill-posed problem. That is, it is not mathematically possible to uniquely estimate the 3rd dimension (or depth) from a single 2D image. Hence, additional constraints need to be incorporated in order to regulate the solution space. As a result, in the first part of this dissertation, the idea of constraining the model for more accurate depth estimation by taking advantage of the similarity between the RGB image and the corresponding depth map at the geometric edges of the 3D scene is explored. Although deep learning based methods are very successful in computer vision and handle noise very well, they suffer from poor generalization when the test and train distributions are not close. While, the geometric methods do not have the generalization problem since they benefit from temporal information in an unsupervised manner. They are sensitive to noise, though. At the same time, explicitly modeling of a dynamic scenes as well as flexible objects in traditional computer vision methods is a big challenge. Considering the advantages and disadvantages of each approach, a hybrid method, which benefits from both, is proposed here by extending traditional geometric models’ abilities to handle flexible and dynamic objects in the scene. This is made possible by relaxing geometric computer vision rules from one motion model for some areas of the scene into one for every pixel in the scene. This enables the model to detect even small, flexible, floating debris in a dynamic scene. However, it makes the optimization under-constrained. To change the optimization from under-constrained to over-constrained while maintaining the model’s flexibility, ”moving object detection loss” and ”synchrony loss” are designed. The algorithm is trained in an unsupervised fashion. The primary results are in no way comparable to the current state of the art. Because the training process is so slow, it is difficult to compare it to the current state of the art. Also, the algorithm lacks stability. In addition, the optical flow model is extremely noisy and naive. At the end, some solutions are suggested to address these issues
