2,072 research outputs found

    Reasoning From Point Clouds

    Get PDF
    Over the past two years, 3D object detection has been a major area of focus across industry and academia. This is primarily due to the difficulty of learning data from point clouds. While camera images are fixed size and can therefore be easily trained on using convolution, point clouds are unstructured series of points in three dimensions. Therefore, there is no fixed number of features, or a structure to run convolution on. Instead, researchers have developed many ways of attempting to learn from this data, however there is no clear consensus on what is the best method, as each has advantages and disadvantages. For this project, I chose to focus on understanding and implementing VoxelNet, a voxelized method for object detection using point cloud data. I used the VoxelNet architecture for the task of detecting objects in the surrounding environment and creating 3D bounding boxes around those objects. I trained these models on the Waymo Open Dataset, then measured performance on the Carla simulator. The goal of training on the Waymo Open Dataset was to gain experience with the new dataset and familiarity with its features, and then evaluate the practicality of the Carla simulator by using a model trained with real-world data in it

    PV-SSD: A Projection and Voxel-based Double Branch Single-Stage 3D Object Detector

    Full text link
    LIDAR-based 3D object detection and classification is crucial for autonomous driving. However, inference in real-time from extremely sparse 3D data poses a formidable challenge. To address this issue, a common approach is to project point clouds onto a bird's-eye or perspective view, effectively converting them into an image-like data format. However, this excessive compression of point cloud data often leads to the loss of information. This paper proposes a 3D object detector based on voxel and projection double branch feature extraction (PV-SSD) to address the problem of information loss. We add voxel features input containing rich local semantic information, which is fully fused with the projected features in the feature extraction stage to reduce the local information loss caused by projection. A good performance is achieved compared to the previous work. In addition, this paper makes the following contributions: 1) a voxel feature extraction method with variable receptive fields is proposed; 2) a feature point sampling method by weight sampling is used to filter out the feature points that are more conducive to the detection task; 3) the MSSFA module is proposed based on the SSFA module. To verify the effectiveness of our method, we designed comparison experiments

    Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving

    Full text link
    Talk2BEV is a large vision-language model (LVLM) interface for bird's-eye view (BEV) maps in autonomous driving contexts. While existing perception systems for autonomous driving scenarios have largely focused on a pre-defined (closed) set of object categories and driving scenarios, Talk2BEV blends recent advances in general-purpose language and vision models with BEV-structured map representations, eliminating the need for task-specific models. This enables a single system to cater to a variety of autonomous driving tasks encompassing visual and spatial reasoning, predicting the intents of traffic actors, and decision-making based on visual cues. We extensively evaluate Talk2BEV on a large number of scene understanding tasks that rely on both the ability to interpret free-form natural language queries, and in grounding these queries to the visual context embedded into the language-enhanced BEV map. To enable further research in LVLMs for autonomous driving scenarios, we develop and release Talk2BEV-Bench, a benchmark encompassing 1000 human-annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset.Comment: Project page at https://llmbev.github.io/talk2bev

    ์‹ค์‹œ๊ฐ„ ์ž์œจ์ฃผํ–‰ ์ธ์ง€ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ์‹ ๊ฒฝ ๋„คํŠธ์›Œํฌ์™€ ๊ตฐ์ง‘ํ™” ๊ธฐ๋ฐ˜ ๋ฏธํ•™์Šต ๋ฌผ์ฒด ๊ฐ์ง€๊ธฐ ํ†ตํ•ฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ๊ธฐ๊ณ„ํ•ญ๊ณต๊ณตํ•™๋ถ€, 2020. 8. ์ด๊ฒฝ์ˆ˜.์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„, ์„ผ์„œ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „๊ณผ ์ปดํ“จํ„ฐ ๊ณตํ•™ ๋ถ„์•ผ์˜ ์„ฑ๊ณผ๋“ค๋กœ ์ธํ•˜์—ฌ ์ž์œจ์ฃผํ–‰ ์—ฐ๊ตฌ๊ฐ€ ๋”์šฑ ํ™œ๋ฐœํ•ด์ง€๊ณ  ์žˆ๋‹ค. ์ž์œจ์ฃผํ–‰ ์‹œ์Šคํ…œ์— ์žˆ์–ด์„œ ์ฐจ๋Ÿ‰ ์ฃผ๋ณ€ ํ™˜๊ฒฝ์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์€ ์•ˆ์ „ ๋ฐ ์‹ ๋ขฐ์„ฑ ์žˆ๋Š” ์ฃผํ–‰์„ ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ธฐ๋Šฅ์ด๋‹ค. ์ž์œจ์ฃผํ–‰ ์‹œ์Šคํ…œ์€ ํฌ๊ฒŒ ์ธ์ง€, ํŒ๋‹จ, ์ œ์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š”๋ฐ, ์ธ์ง€ ๋ชจ๋“ˆ์€ ์ž์œจ์ฃผํ–‰ ์ฐจ๋Ÿ‰์ด ๊ฒฝ๋กœ๋ฅผ ์„ค์ •ํ•˜๊ณ  ํŒ๋‹จ, ์ œ์–ด๋ฅผ ํ•จ์— ์•ž์„œ ์ฃผ๋ณ€ ๋ฌผ์ฒด์˜ ์œ„์น˜์™€ ์›€์ง์ž„์„ ํŒŒ์•…ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ์ž์œจ์ฃผํ–‰ ์ธ์ง€ ๋ชจ๋“ˆ์€ ์ฃผํ–‰ ํ™˜๊ฒฝ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์„ผ์„œ๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค. ๊ทธ ์ค‘์—์„œ๋„ LiDAR์€ ํ˜„์žฌ ๋งŽ์€ ์ž์œจ์ฃผํ–‰ ์—ฐ๊ตฌ์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์„ผ์„œ ์ค‘ ํ•˜๋‚˜๋กœ, ๋ฌผ์ฒด์˜ ๊ฑฐ๋ฆฌ ์ •๋ณด ํš๋“์— ์žˆ์–ด์„œ ๋งค์šฐ ์œ ์šฉํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” LiDAR์—์„œ ์ƒ์„ฑ๋˜๋Š” ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ raw ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์žฅ์• ๋ฌผ์˜ 3D ์ •๋ณด๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ด๋“ค์„ ์ถ”์ ํ•˜๋Š” ์ธ์ง€ ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•œ๋‹ค. ์ธ์ง€ ๋ชจ๋“ˆ์˜ ์ „์ฒด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ํฌ๊ฒŒ ์„ธ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. 1๋‹จ๊ณ„๋Š” ๋น„์ง€๋ฉด ํฌ์ธํŠธ ์ถ”์ •์„ ์œ„ํ•œ ๋งˆ์Šคํฌ ์ƒ์„ฑ, 2๋‹จ๊ณ„๋Š” ํŠน์ง• ์ถ”์ถœ ๋ฐ ์žฅ์• ๋ฌผ ๊ฐ์ง€, 3๋‹จ๊ณ„๋Š” ์žฅ์• ๋ฌผ ์ถ”์ ์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ๋ฌผ์ฒด ํƒ์ง€๊ธฐ๋Š” ์ง€๋„ํ•™์Šต์„ ํ†ตํ•ด ํ•™์Šต๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ์žฅ์• ๋ฌผ ํƒ์ง€๊ธฐ๋Š” ํ•™์Šตํ•œ ์žฅ์• ๋ฌผ์„ ์ฐพ๋Š”๋‹ค๋Š” ๋ฐฉ๋ฒ•๋ก ์  ํ•œ๊ณ„๋ฅผ ์ง€๋‹ˆ๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ ์ฃผํ–‰์ƒํ™ฉ์—์„œ๋Š” ๋ฏธ์ฒ˜ ํ•™์Šตํ•˜์ง€ ๋ชปํ•œ ๋ฌผ์ฒด๋ฅผ ๋งˆ์ฃผํ•˜๊ฑฐ๋‚˜ ์‹ฌ์ง€์–ด ํ•™์Šตํ•œ ๋ฌผ์ฒด๋„ ๋†“์น  ์ˆ˜ ์žˆ๋‹ค. ์ธ์ง€ ๋ชจ๋“ˆ์˜ 1๋‹จ๊ณ„์—์„œ ์ด๋Ÿฌํ•œ ์ง€๋„ํ•™์Šต์˜ ๋ฐฉ๋ฒ•๋ก ์  ํ•œ๊ณ„์— ๋Œ€์ฒ˜ํ•˜๊ธฐ ์œ„ํ•ด ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋ฅผ ์ผ์ •ํ•œ ๊ฐ„๊ฒฉ์œผ๋กœ ๊ตฌ์„ฑ๋œ 3D ๋ณต์…€(voxel)๋กœ ๋ถ„ํ• ํ•˜๊ณ , ์ด๋กœ๋ถ€ํ„ฐ ๋น„์ ‘์ง€์ ๋“ค์„ ์ถ”์ถœํ•œ ๋’ค ๋ฏธ์ง€์˜ ๋ฌผ์ฒด(Unknown object)๋ฅผ ํƒ์ง€ํ•œ๋‹ค. 2๋‹จ๊ณ„์—์„œ๋Š” ๊ฐ ๋ณต์…€์˜ ํŠน์„ฑ์„ ์ถ”์ถœ ๋ฐ ํ•™์Šตํ•˜๊ณ  ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ ๊ฐ์ฒด ๊ฐ์ง€๊ธฐ๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. ๋งˆ์ง€๋ง‰ 3๋‹จ๊ณ„์—์„œ๋Š” ์นผ๋งŒ ํ•„ํ„ฐ์™€ ํ—๊ฐ€๋ฆฌ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•œ ๋‹ค์ค‘ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌ์„ฑ๋œ ์ธ์ง€ ๋ชจ๋“ˆ์€ ๋น„์ง€๋ฉด ์ ๋“ค์„ ์ถ”์ถœํ•˜์—ฌ ํ•™์Šตํ•˜์ง€ ์•Š์€ ๋ฌผ์ฒด์— ๋Œ€ํ•ด์„œ๋„ ๋ฏธ์ง€์˜ ๋ฌผ์ฒด(Unknown object)๋กœ ๊ฐ์ง€ํ•˜์—ฌ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์žฅ์• ๋ฌผ ํƒ์ง€๊ธฐ๋ฅผ ๋ณด์™„ํ•œ๋‹ค. ์ตœ๊ทผ ๋ผ์ด๋‹ค๋ฅผ ํ™œ์šฉํ•œ ์ž์œจ์ฃผํ–‰ ์šฉ ๊ฐ์ฒด ํƒ์ง€๊ธฐ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ์œผ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋“ค์€ ๋‹จ์ผ ํ”„๋ ˆ์ž„์˜ ๋ฌผ์ฒด ์ธ์‹์— ๋Œ€ํ•ด ์ง‘์ค‘ํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ์˜ฌ๋ฆฌ๋Š” ๋ฐ ์ง‘์ค‘ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์—ฐ๊ตฌ๋Š” ๊ฐ์ง€ ์ค‘์š”๋„์™€ ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ๊ฐ์ง€ ์—ฐ์†์„ฑ ๋“ฑ์— ๋Œ€ํ•œ ๊ณ ๋ ค๊ฐ€ ๋˜์–ด์žˆ์ง€ ์•Š๋‹ค๋Š” ํ•œ๊ณ„์ ์ด ์กด์žฌํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„์„ ๊ณ ๋ คํ•œ ์„ฑ๋Šฅ ์ง€์ˆ˜๋ฅผ ์ œ์•ˆํ•˜๊ณ , ์‹ค์ฐจ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆํ•œ ์ธ์ง€ ๋ชจ๋“ˆ์„ ํ…Œ์ŠคํŠธ, ์ œ์•ˆํ•œ ์„ฑ๋Šฅ ์ง€์ˆ˜๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•˜์˜€๋‹ค.In recent few years, the interest in automotive researches on autonomous driving system has been grown up due to advances in sensing technologies and computer science. In the development of autonomous driving system, knowledge about the subject vehicles surroundings is the most essential function for safe and reliable driving. When it comes to making decisions and planning driving scenarios, to know the location and movements of surrounding objects and to distinguish whether an object is a car or pedestrian give valuable information to the autonomous driving system. In the autonomous driving system, various sensors are used to understand the surrounding environment. Since LiDAR gives the distance information of surround objects, it has been the one of the most commonly used sensors in the development of perception system. Despite achievement of the deep neural network research field, its application and research trends on 3D object detection using LiDAR point cloud tend to pursue higher accuracy without considering a practical application. A deep neural-network-based perception module heavily depends on the training dataset, but it is impossible to cover all the possibilities and corner cases. To apply the perception module in actual driving, it needs to detect unknown objects and unlearned objects, which may face on the road. To cope with these problems, in this dissertation, a perception module using LiDAR point cloud is proposed, and its performance is validated via real vehicle test. The whole framework is composed of three stages : stage-1 for the ground estimation playing as a mask for point filtering which are considered as non-ground and stage-2 for feature extraction and object detection, and stage-3 for object tracking. In the first stage, to cope with the methodological limit of supervised learning that only finds learned object, we divide a point cloud into equally spaced 3D voxels the point cloud and extract non-ground points and cluster the points to detect unknown objects. In the second stage, the voxelization is utilized to learn the characteristics of point clouds organized in vertical columns. The trained network can distinguish the object through the extracted features from point clouds. In non-maximum suppression process, we sort the predictions according to IoU between prediction and polygon to select a prediction close to the actual heading angle of the object. The last stage presents a 3D multiple object tracking solution. Through Kalman filter, the learned and unlearned objects next movement is predicted and this prediction updated by measurement detection. Through this process, the proposed object detector complements the detector based on supervised learning by detecting the unlearned object as an unknown object through non-ground point extraction. Recent researches on object detection for autonomous driving have been actively conducted, but recent works tend to focus more on the recognition of the objects at every single frame and developing accurate system. To obtain a real-time performance, this paper focuses on more practical aspects by propose a performance index considering detection priority and detection continuity. The performance of the proposed algorithm has been investigated via real-time vehicle test.Chapter 1 Introduction 1 1.1. Background and Motivation 1 1.2. Overview and Previous Researches 4 1.3. Thesis Objectives 12 1.4. Thesis Outline 14 Chapter 2 Overview of a Perception in Automated Driving 15 Chapter 3 Object Detector 18 3.1. Voxelization & Feature Extraction 22 3.2. Backbone Network 25 3.3. Detection Head & Loss Function Design 28 3.4. Loss Function Design 30 3.5. Data Augmentation 33 3.6. Post Process 39 Chapter 4 Non-Ground Point Clustering 42 4.1. Previous Researches for Ground Removal 44 4.2. Non-Ground Estimation using Voxelization 45 4.3. Non-ground Object Segmentation 50 4.3.1. Object Clustering 52 4.3.2. Bounding Polygon 55 Chapter 5 . Object Tracking 57 5.1. State Prediction and Update 58 5.2. Data Matching Association 60 Chapter 6 Test result for KITTI dataset 62 6.1. Quantitative Analysis 62 6.2. Qualitative Analysis 72 6.3. Additional Training 76 6.3.1. Additional data acquisition 78 6.3.2. Qualitative Analysis 81 Chapter 7 Performance Evaluation 85 7.1. Current Evaluation Metrics 85 7.2. Limitations of Evaluation Metrics 87 7.2.1. Detection Continuity 87 7.2.2. Detection Priority 89 7.3. Criteria for Performance Index 91 Chapter 8 Vehicle Tests based Performance Evaluation 95 8.1. Configuration of Vehicle Tests 95 8.2. Qualitative Analysis 100 8.3. Quantitative Analysis 105 Chapter 9 Conclusions and Future Works 107 Bibliography 109 ๊ตญ๋ฌธ ์ดˆ๋ก 114Docto

    On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

    Full text link
    The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V(ision), and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: \url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration
    • โ€ฆ
    corecore