39 research outputs found

    Combined Learned and Classical Methods for Real-Time Visual Perception in Autonomous Driving

    Full text link
    Autonomy, robotics, and Artificial Intelligence (AI) are among the main defining themes of next-generation societies. Of the most important applications of said technologies is driving automation which spans from different Advanced Driver Assistance Systems (ADAS) to full self-driving vehicles. Driving automation is promising to reduce accidents, increase safety, and increase access to mobility for more people such as the elderly and the handicapped. However, one of the main challenges facing autonomous vehicles is robust perception which can enable safe interaction and decision making. With so many sensors to perceive the environment, each with its own capabilities and limitations, vision is by far one of the main sensing modalities. Cameras are cheap and can provide rich information of the observed scene. Therefore, this dissertation develops a set of visual perception algorithms with a focus on autonomous driving as the target application area. This dissertation starts by addressing the problem of real-time motion estimation of an agent using only the visual input from a camera attached to it, a problem known as visual odometry. The visual odometry algorithm can achieve low drift rates over long-traveled distances. This is made possible through the innovative local mapping approach used. This visual odometry algorithm was then combined with my multi-object detection and tracking system. The tracking system operates in a tracking-by-detection paradigm where an object detector based on convolution neural networks (CNNs) is used. Therefore, the combined system can detect and track other traffic participants both in image domain and in 3D world frame while simultaneously estimating vehicle motion. This is a necessary requirement for obstacle avoidance and safe navigation. Finally, the operational range of traditional monocular cameras was expanded with the capability to infer depth and thus replace stereo and RGB-D cameras. This is accomplished through a single-stream convolution neural network which can output both depth prediction and semantic segmentation. Semantic segmentation is the process of classifying each pixel in an image and is an important step toward scene understanding. Literature survey, algorithms descriptions, and comprehensive evaluations on real-world datasets are presented.Ph.D.College of Engineering & Computer ScienceUniversity of Michiganhttps://deepblue.lib.umich.edu/bitstream/2027.42/153989/1/Mohamed Aladem Final Dissertation.pdfDescription of Mohamed Aladem Final Dissertation.pdf : Dissertatio

    Visual-Inertial Sensor Fusion Models and Algorithms for Context-Aware Indoor Navigation

    Get PDF
    Positioning in navigation systems is predominantly performed by Global Navigation Satellite Systems (GNSSs). However, while GNSS-enabled devices have become commonplace for outdoor navigation, their use for indoor navigation is hindered due to GNSS signal degradation or blockage. For this, development of alternative positioning approaches and techniques for navigation systems is an ongoing research topic. In this dissertation, I present a new approach and address three major navigational problems: indoor positioning, obstacle detection, and keyframe detection. The proposed approach utilizes inertial and visual sensors available on smartphones and are focused on developing: a framework for monocular visual internal odometry (VIO) to position human/object using sensor fusion and deep learning in tandem; an unsupervised algorithm to detect obstacles using sequence of visual data; and a supervised context-aware keyframe detection. The underlying technique for monocular VIO is a recurrent convolutional neural network for computing six-degree-of-freedom (6DoF) in an end-to-end fashion and an extended Kalman filter module for fine-tuning the scale parameter based on inertial observations and managing errors. I compare the results of my featureless technique with the results of conventional feature-based VIO techniques and manually-scaled results. The comparison results show that while the framework is more effective compared to featureless method and that the accuracy is improved, the accuracy of feature-based method still outperforms the proposed approach. The approach for obstacle detection is based on processing two consecutive images to detect obstacles. Conducting experiments and comparing the results of my approach with the results of two other widely used algorithms show that my algorithm performs better; 82% precision compared with 69%. In order to determine the decent frame-rate extraction from video stream, I analyzed movement patterns of camera and inferred the context of the user to generate a model associating movement anomaly with proper frames-rate extraction. The output of this model was utilized for determining the rate of keyframe extraction in visual odometry (VO). I defined and computed the effective frames for VO and experimented with and used this approach for context-aware keyframe detection. The results show that the number of frames, using inertial data to infer the decent frames, is decreased

    Visual-Inertial State Estimation With Information Deficiency

    Get PDF
    State estimation is an essential part of intelligent navigation and mapping systems where tracking the location of a smartphone, car, robot, or a human-worn device is required. For autonomous systems such as micro aerial vehicles and self-driving cars, it is a prerequisite for control and motion planning. For AR/VR applications, it is the first step to image rendering. Visual-inertial odometry (VIO) is the de-facto standard algorithm for embedded platforms because it lends itself to lightweight sensors and processors, and maturity in research and industrial development. Various approaches have been proposed to achieve accurate real-time tracking, and numerous open-source software and datasets are available. However, errors and outliers are common due to the complexity of visual measurement processes and environmental changes, and in practice, estimation drift is inevitable. In this thesis, we introduce the concept of information deficiency in state estimation and how to utilize this concept to develop and improve VIO systems. We look into the information deficiencies in visual-inertial state estimation, which are often present and ignored, causing system failures and drift. In particular, we investigate three critical cases of information deficiency in visual-inertial odometry: low texture environment with limited computation, monocular visual odometry, and inertial odometry. We consider these systems under three specific application settings: a lightweight quadrotor platform in autonomous flight, driving scenarios, and AR/VR headset for pedestrians. We address the challenges in each application setting and explore how the tight fusion of deep learning and model-based VIO can improve the state-of-the-art system performance and compensate for the lack of information in real-time. We identify deep learning as a key technology in tackling the information deficiencies in state estimation. We argue that developing hybrid frameworks that leverage its advantage and enable supervision for performance guarantee provides the most accurate and robust solution to state estimation

    Portable Robotic Navigation Aid for the Visually Impaired

    Get PDF
    This dissertation aims to address the limitations of existing visual-inertial (VI) SLAM methods - lack of needed robustness and accuracy - for assistive navigation in a large indoor space. Several improvements are made to existing SLAM technology, and the improved methods are used to enable two robotic assistive devices, a robot cane, and a robotic object manipulation aid, for the visually impaired for assistive wayfinding and object detection/grasping. First, depth measurements are incorporated into the optimization process for device pose estimation to improve the success rate of VI SLAM\u27s initialization and reduce scale drift. The improved method, called depth-enhanced visual-inertial odometry (DVIO), initializes itself immediately as the environment\u27s metric scale can be derived from the depth data. Second, a hybrid PnP (perspective n-point) method is introduced for a more accurate estimation of the pose change between two camera frames by using the 3D data from both frames. Third, to implement DVIO on a smartphone with variable camera intrinsic parameters (CIP), a method called CIP-VMobile is devised to simultaneously estimate the intrinsic parameters and motion states of the camera. CIP-VMobile estimates in real time the CIP, which varies with the smartphone\u27s pose due to the camera\u27s optical image stabilization mechanism, resulting in more accurate device pose estimates. Various experiments are performed to validate the VI-SLAM methods with the two robotic assistive devices. Beyond these primary objectives, SM-SLAM is proposed as a potential extension for the existing SLAM methods in dynamic environments. This forward-looking exploration is premised on the potential that incorporating dynamic object detection capabilities in the front-end could improve SLAM\u27s overall accuracy and robustness. Various experiments have been conducted to validate the efficacy of this newly proposed method, using both public and self-collected datasets. The results obtained substantiate the viability of this innovation, leaving a deeper investigation for future work

    Localization Algorithms for GNSS-denied and Challenging Environments

    Get PDF
    In this dissertation, the problem about localization in GNSS-denied and challenging environments is addressed. Specifically, the challenging environments discussed in this dissertation include two different types, environments including only low-resolution features and environments containing moving objects. To achieve accurate pose estimates, the errors are always bounded through matching observations from sensors with surrounding environments. These challenging environments, unfortunately, would bring troubles into matching related methods, such as fingerprint matching, and ICP. For instance, in environments with low-resolution features, the on-board sensor measurements could match to multiple positions on a map, which creates ambiguity; in environments with moving objects included, the accuracy of the estimated localization is affected by the moving objects when performing matching. In this dissertation, two sensor fusion based strategies are proposed to solve localization problems with respect to these two types of challenging environments, respectively. For environments with only low-resolution features, such as flying over sea or desert, a multi-agent localization algorithm using pairwise communication with ranging and magnetic anomaly measurements is proposed in this dissertation. A scalable framework is then presented to extend the multi-agent localization algorithm to be suitable for a large group of agents (e.g., 128 agents) through applying CI algorithm. The simulation results show that the proposed algorithm is able to deal with large group sizes, achieve 10 meters level localization performance with 180 km traveling distance, while under restrictive communication constraints. For environments including moving objects, lidar-inertial-based solutions are proposed and tested in this dissertation. Inspired by the CI algorithm presented above, a potential solution using multiple features motions estimate and tracking is analyzed. In order to improve the performance and effectiveness of the potential solution, a lidar-inertial based SLAM algorithm is then proposed. In this method, an efficient tightly-coupled iterated Kalman filter with a build-in dynamic object filter is designed as the front-end of the SLAM algorithm, and the factor graph strategy using a scan context technology as the loop closure detection is utilized as the back-end. The performance of the proposed lidar-inertial based SLAM algorithm is evaluated with several data sets collected in environments including moving objects, and compared with the state-of-the-art lidar-inertial based SLAM algorithms

    Vision based localization: from humanoid robots to visually impaired people

    Get PDF
    Nowadays, 3D applications have recently become a more and more popular topic in robotics, computer vision or augmented reality. By means of cameras and computer vision techniques, it is possible to obtain accurate 3D models of large-scale environments such as cities. In addition, cameras are low-cost, non-intrusive sensors compared to other sensors such as laser scanners. Furthermore, cameras also offer a rich information about the environment. One application of great interest is the vision-based localization in a prior 3D map. Robots need to perform tasks in the environment autonomously, and for this purpose, is very important to know precisely the location of the robot in the map. In the same way, providing accurate information about the location and spatial orientation of the user in a large-scale environment can be of benefit for those who suffer from visual impairment problems. A safe and autonomous navigation in unknown or known environments, can be a great challenge for those who are blind or are visually impaired. Most of the commercial solutions for visually impaired localization and navigation assistance are based on the satellite Global Positioning System (GPS). However, these solutions are not suitable enough for the visually impaired community in urban-environments. The errors are about of the order of several meters and there are also other problems such GPS signal loss or line-of-sight restrictions. In addition, GPS does not work if an insufficient number of satellites are directly visible. Therefore, GPS cannot be used for indoor environments. Thus, it is important to do further research on new more robust and accurate localization systems. In this thesis we propose several algorithms in order to obtain an accurate real-time vision-based localization from a prior 3D map. For that purpose, it is necessary to compute a 3D map of the environment beforehand. For computing that 3D map, we employ well-known techniques such as Simultaneous Localization and Mapping (SLAM) or Structure from Motion (SfM). In this thesis, we implement a visual SLAM system using a stereo camera as the only sensor that allows to obtain accurate 3D reconstructions of the environment. The proposed SLAM system is also capable to detect moving objects especially in a close range to the camera up to approximately 5 meters, thanks to a moving objects detection module. This is possible, thanks to a dense scene flow representation of the environment, that allows to obtain the 3D motion of the world points. This moving objects detection module seems to be very effective in highly crowded and dynamic environments, where there are a huge number of dynamic objects such as pedestrians. By means of the moving objects detection module we avoid adding erroneous 3D points into the SLAM process, yielding much better and consistent 3D reconstruction results. Up to the best of our knowledge, this is the first time that dense scene flow and derived detection of moving objects has been applied in the context of visual SLAM for challenging crowded and dynamic environments, such as the ones presented in this Thesis. In SLAM and vision-based localization approaches, 3D map points are usually described by means of appearance descriptors. By means of these appearance descriptors, the data association between 3D map elements and perceived 2D image features can be done. In this thesis we have investigated a novel family of appearance descriptors known as Gauge-Speeded Up Robust Features (G-SURF). Those descriptors are based on the use of gauge coordinates. By means of these coordinates every pixel in the image is fixed separately in its own local coordinate frame defined by the local structure itself and consisting of the gradient vector and its perpendicular direction. We have carried out an extensive experimental evaluation on different applications such as image matching, visual object categorization and 3D SfM applications that show the usefulness and improved results of G-SURF descriptors against other state-of-the-art descriptors such as the Scale Invariant Feature Transform (SIFT) or SURF. In vision-based localization applications, one of the most expensive computational steps is the data association between a large map of 3D points and perceived 2D features in the image. Traditional approaches often rely on purely appearence information for solving the data association step. These algorithms can have a high computational demand and for environments with highly repetitive textures, such as cities, this data association can lead to erroneous results due to the ambiguities introduced by visually similar features. In this thesis we have done an algorithm for predicting the visibility of 3D points by means of a memory based learning approach from a prior 3D reconstruction. Thanks to this learning approach, we can speed-up the data association step by means of the prediction of visible 3D points given a prior camera pose. We have implemented and evaluated visual SLAM and vision-based localization algorithms for two different applications of great interest: humanoid robots and visually impaired people. Regarding humanoid robots, a monocular vision-based localization algorithm with visibility prediction has been evaluated under different scenarios and different types of sequences such as square trajectories, circular, with moving objects, changes in lighting, etc. A comparison of the localization and mapping error has been done with respect to a precise motion capture system, yielding errors about the order of few cm. Furthermore, we also compared our vision-based localization system with respect to the Parallel Tracking and Mapping (PTAM) approach, obtaining much better results with our localization algorithm. With respect to the vision-based localization approach for the visually impaired, we have evaluated the vision-based localization system in indoor and cluttered office-like environments. In addition, we have evaluated the visual SLAM algorithm with moving objects detection considering test with real visually impaired users in very dynamic environments such as inside the Atocha railway station (Madrid, Spain) and in the city center of Alcalá de Henares (Madrid, Spain). The obtained results highlight the potential benefits of our approach for the localization of the visually impaired in large and cluttered environments

    Robust ego-localization using monocular visual odometry

    Get PDF

    On Deep Learning Enhanced Multi-Sensor Odometry and Depth Estimation

    Get PDF
    In this thesis, we systematically study the integration of deep learning and simultaneous localization and mapping (SLAM) and advance the research frontier by making the following contributions. (1) We devise a unified information theoretic framework for end-to-end learning methods aimed at odometry estimation, which not only improves the accuracy empirically, but provides an elegant theoretical tool for performance evaluation and understanding in information theoretical language. (2) For the integration of learning and geometry, we put our research focus on the scale ambiguity problem in monocular SLAM and odometry systems. To this end, we first propose VRVO (Virtual-to-Real Visual Odometry) which retrieves the absolute scale from virtual data, adapts the learnt features between real and virtual domains, and establishes a mutual reinforcement pipeline between learning and optimization to further leverage the complementary information. The depth maps are used to carry the scale information, which are then integrated with classical SLAM systems by providing initialization values and dense virtual stereo objectives. (3) Since modern sensor-suits usually contain multiple sensors including camera and IMU, we further propose DynaDepth, an unsupervised monocular depth estimation method that integrates IMU motion dynamics. A differentiable camera-centric extended Kalman filter (EKF) framework is derived to exploit the complementary information from both camera and IMU sensors, which also provides an uncertainty measure for the ego-motion predictions. The proposed depth network not only learns the absolute scale, but exhibits better generalization ability and robustness against vision degradation. And the resulting depth predictions can be integrated into classical SLAM systems in the similar way as VRVO to achieve a scale-aware monocular SLAM system during inference

    Robust multimodal dense SLAM

    Get PDF
    To enable increasingly intelligent behaviours, autonomous robots will need to be equipped with a deep understanding of their surrounding environment. It would be particularly desirable if this level of perception could be achieved automatically through the use of vision-based sensing, as passive cameras make a compelling sensor choice for robotic platforms due to their low cost, low weight, and low power consumption. Fundamental to extracting a high-level understanding from a set of 2D images is an understanding of the underlying 3D geometry of the environment. In mobile robotics, the most popular and successful technique for building a representation of 3D geometry from 2D images is Visual Simultaneous Localisation and Mapping (SLAM). While sparse, landmark-based SLAM systems have demonstrated high levels of accuracy and robustness, they are only capable of producing sparse maps. In general, to move beyond simple navigation to scene understanding and interaction, dense 3D reconstructions are required. Dense SLAM systems naturally allow for online dense scene reconstruction, but suffer from a lack of robustness due to the fact that the dense image alignment used in the tracking step has a narrow convergence basin and that the photometric-based depth estimation used in the mapping step is typically poorly constrained due to the presence of occlusions and homogeneous textures. This thesis develops methods that can be used to increase the robustness of dense SLAM by fusing additional sensing modalities into standard dense SLAM pipelines. In particular, this thesis will look at two sensing modalities: acceleration and rotation rate measurements from an inertial measurement unit (IMU) to address the tracking issue, and learned priors on dense reconstructions from deep neural networks (DNNs) to address the mapping issue.Open Acces
    corecore