19,150 research outputs found

    Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

    Full text link
    As autonomous driving technology matures, end-to-end methodologies have emerged as a leading strategy, promising seamless integration from perception to control via deep learning. However, existing systems grapple with challenges such as unexpected open set environments and the complexity of black-box models. At the same time, the evolution of deep learning introduces larger, multimodal foundational models, offering multi-modal visual and textual understanding. In this paper, we harness these multimodal foundation models to enhance the robustness and adaptability of autonomous driving systems, enabling out-of-distribution, end-to-end, multimodal, and more explainable autonomy. Specifically, we present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. To do so, we introduce a method to extract nuanced spatial (pixel/patch-aligned) features from transformers to enable the encapsulation of both spatial and semantic features. Our approach (i) demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations, and (ii) allows the incorporation of latent space simulation (via text) for improved training (data augmentation via text) and policy debugging. We encourage the reader to check our explainer video at https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the code and demos on our project webpage at https://drive-anywhere.github.io/.Comment: Project webpage: https://drive-anywhere.github.io Explainer video: https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.b

    Research on obstacle avoidance optimization and path planning of autonomous vehicles based on attention mechanism combined with multimodal information decision-making thoughts of robots

    Get PDF
    With the development of machine perception and multimodal information decision-making techniques, autonomous driving technology has become a crucial area of advancement in the transportation industry. The optimization of vehicle navigation, path planning, and obstacle avoidance tasks is of paramount importance. In this study, we explore the use of attention mechanisms in a end-to-end architecture for optimizing obstacle avoidance and path planning in autonomous driving vehicles. We position our research within the broader context of robotics, emphasizing the fusion of information and decision-making capabilities. The introduction of attention mechanisms enables vehicles to perceive the environment more accurately by focusing on important information and making informed decisions in complex scenarios. By inputting multimodal information, such as images and LiDAR data, into the attention mechanism module, the system can automatically learn and weigh crucial environmental features, thereby placing greater emphasis on key information during obstacle avoidance decisions. Additionally, we leverage the end-to-end architecture and draw from classical theories and algorithms in the field of robotics to enhance the perception and decision-making abilities of autonomous driving vehicles. Furthermore, we address the optimization of path planning using attention mechanisms. We transform the vehicle's navigation task into a sequential decision-making problem and employ LSTM (Long Short-Term Memory) models to handle dynamic navigation in varying environments. By applying attention mechanisms to weigh key points along the navigation path, the vehicle can flexibly select the optimal route and dynamically adjust it based on real-time conditions. Finally, we conducted extensive experimental evaluations and software experiments on the proposed end-to-end architecture on real road datasets. The method effectively avoids obstacles, adheres to traffic rules, and achieves stable, safe, and efficient autonomous driving in diverse road scenarios. This research provides an effective solution for optimizing obstacle avoidance and path planning in the field of autonomous driving. Moreover, it contributes to the advancement and practical applications of multimodal information fusion in navigation, localization, and human-robot interaction

    Multimodal End-to-End Learning for Autonomous Steering in Adverse Road and Weather Conditions

    Get PDF
    Autonomous driving is challenging in adverse road and weather conditions in which there might not be lane lines, the road might be covered in snow and the visibility might be poor. We extend the previous work on end-to-end learning for autonomous steering to operate in these adverse real-life conditions with multimodal data. We collected 28 hours of driving data in several road and weather conditions and trained convolutional neural networks to predict the car steering wheel angle from front-facing color camera images and lidar range and reflectance data. We compared the CNN model performances based on the different modalities and our results show that the lidar modality improves the performances of different multimodal sensor-fusion models. We also performed on-road tests with different models and they support this observation

    Multimodal Manoeuvre and Trajectory Prediction for Autonomous Vehicles Using Transformer Networks

    Full text link
    Predicting the behaviour (i.e. manoeuvre/trajectory) of other road users, including vehicles, is critical for the safe and efficient operation of autonomous vehicles (AVs), a.k.a. automated driving systems (ADSs). Due to the uncertain future behaviour of vehicles, multiple future behaviour modes are often plausible for a vehicle in a given driving scene. Therefore, multimodal prediction can provide richer information than single-mode prediction enabling AVs to perform a better risk assessment. To this end, we propose a novel multimodal prediction framework that can predict multiple plausible behaviour modes and their likelihoods. The proposed framework includes a bespoke problem formulation for manoeuvre prediction, a novel transformer-based prediction model, and a tailored training method for multimodal manoeuvre and trajectory prediction. The performance of the framework is evaluated using two public benchmark highway driving datasets, namely NGSIM and highD. The results show that the proposed framework outperforms the state-of-the-art multimodal methods in the literature in terms of prediction error and is capable of predicting plausible manoeuvre and trajectory modes.Comment: 8 pages, 3 figures, submitted to IEEE RA

    Integrating state-of-the-art CNNs for multi-sensor 3D vehicle detection in real autonomous driving environments

    Get PDF
    2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27-30 Oct. 2019This paper presents two new approaches to detect surrounding vehicles in 3D urban driving scenes and their corresponding Bird’s Eye View (BEV). The proposals integrate two state-of-the-art Convolutional Neural Networks (CNNs), such as YOLOv3 and Mask-RCNN, in a framework presented by the authors in [1] for 3D vehicles detection fusing semantic image segmentation and LIDAR point cloud. Our proposals take advantage of multimodal fusion, geometrical constrains, and pre-trained modules inside our framework. The methods have been tested using the KITTI object detection benchmark and comparison is presented. Experiments show new approaches improve results with respect to the baseline and are on par with other competitive state-of-the-art proposals, being the only ones that do not apply an end-to-end learning process. In this way, they remove the need to train on a specific dataset and show a good capability of generalization to any domain, a key point for self-driving systems. Finally, we have tested our best proposal in KITTI in our driving environment, without any adaptation, obtaining results suitable for our autonomous driving application.Ministerio de Economía y CompetitividadComunidad de Madri

    Multimodal manoeuvre and trajectory prediction for automated driving on highways using transformer networks

    Get PDF
    Predicting the behaviour (i.e., manoeuvre/trajectory) of other road users, including vehicles, is critical for the safe and efficient operation of autonomous vehicles (AVs), a.k.a., automated driving systems (ADSs). Due to the uncertain future behaviour of vehicles, multiple future behaviour modes are often plausible for a vehicle in a given driving scene. Therefore, multimodal prediction can provide richer information than single-mode prediction, enabling AVs to perform a better risk assessment. To this end, we propose a novel multimodal prediction framework that can predict multiple plausible behaviour modes and their likelihoods. The proposed framework includes a bespoke problem formulation for manoeuvre prediction, a novel transformer-based prediction model, and a tailored training method for multimodal manoeuvre and trajectory prediction. The performance of the framework is evaluated using three public highway driving datasets, namely NGSIM, highD, and exiD. The results show that our framework outperforms the state-of-the-art multimodal methods in terms of prediction error and is capable of predicting plausible manoeuvre and trajectory modes

    Multimodal 3D Object Detection from Simulated Pretraining

    Full text link
    The need for simulated data in autonomous driving applications has become increasingly important, both for validation of pretrained models and for training new models. In order for these models to generalize to real-world applications, it is critical that the underlying dataset contains a variety of driving scenarios and that simulated sensor readings closely mimics real-world sensors. We present the Carla Automated Dataset Extraction Tool (CADET), a novel tool for generating training data from the CARLA simulator to be used in autonomous driving research. The tool is able to export high-quality, synchronized LIDAR and camera data with object annotations, and offers configuration to accurately reflect a real-life sensor array. Furthermore, we use this tool to generate a dataset consisting of 10 000 samples and use this dataset in order to train the 3D object detection network AVOD-FPN, with finetuning on the KITTI dataset in order to evaluate the potential for effective pretraining. We also present two novel LIDAR feature map configurations in Bird's Eye View for use with AVOD-FPN that can be easily modified. These configurations are tested on the KITTI and CADET datasets in order to evaluate their performance as well as the usability of the simulated dataset for pretraining. Although insufficient to fully replace the use of real world data, and generally not able to exceed the performance of systems fully trained on real data, our results indicate that simulated data can considerably reduce the amount of training on real data required to achieve satisfactory levels of accuracy.Comment: 12 pages, part of proceedings for the NAIS 2019 symposiu
    corecore