7,650 research outputs found
๊ตฌ์กฐ ๊ฐ์ํ ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ธฐ๋ฒ๊ณผ ํผํฉ ๋ฐ๋ ์ ๊ฒฝ๋ง์ ์ด์ฉํ ๋ผ์ด๋ค ๊ธฐ๋ฐ 3์ฐจ์ ๊ฐ์ฒด ๊ฒ์ถ ๊ฐ์
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ์ตํฉ๊ณผํ๊ธฐ์ ๋ํ์ ์ตํฉ๊ณผํ๋ถ, 2023. 2. ๊ณฝ๋
ธ์ค.์์จ์ฃผํ์๋์ฐจ, ๋ก๋ด์ ์ธ์ ์ฅ๋น๋ก ๋ง์ด ํ์ฉ๋๊ณ ์๋ ๋ผ์ด๋ค (LiDAR) ๋ ๋ ์ด์ ํ์ค๋ฅผ ๋ฐฉ์ถํ์ฌ ๋๋์์ค๋ ์๊ฐ์ ๊ณ์ฐํ์ฌ ํฌ์ธํธ ํด๋ผ์ฐ๋ (point cloud) ํํ๋ก ์ฃผ๋ณ ํ๊ฒฝ์ ๊ฐ์งํ๋ค. ์ฃผ๋ณ ํ๊ฒฝ์ ๊ฐ์งํ ๋ ๊ฐ์ฅ ์ค ์ํ ๋ถ๋ถ์ ๊ทผ์ฒ์ ์ด๋ค ๊ฐ์ฒด๊ฐ ์๋์ง, ์ด๋์ ์์นํด ์๋์ง๋ฅผ ์ธ์ํ๋ ๊ฒ์ด๊ณ ์ด๋ฌํ ์์
์ ์ํํ๊ธฐ ์ํด ํฌ์ธํธ ํด๋ผ์ฐ๋๋ฅผ ํ์ฉํ๋ 3์ฐจ์ ๊ฐ ์ฒด ๊ฒ์ถ ๊ธฐ์ ๋ค์ด ๋ง์ด ์ฐ๊ตฌ๋๊ณ ์๋ค.
ํฌ์ธํธ ํด๋ผ์ฐ๋ ๋ฐ์ดํฐ์ ์ ์ฒ๋ฆฌ ๋ฐฉ๋ฒ์ ๋ฐ๋ผ ๋งค์ฐ ๋ค์ํ ๊ตฌ์กฐ์ ๋ฐฑ๋ณธ ๋คํธ์ํฌ (backbone network) ๊ฐ ์ฐ๊ตฌ๋๊ณ ์๋ค. ๊ณ ๋ํ๋ ๋ฐฑ๋ณธ ๋คํธ์ํฌ๋ค๋ก ์ธํด ์ธ์ ์ฑ๋ฅ์ ํฐ ๋ฐ์ ์ ์ด๋ฃจ์์ง๋ง, ์ด๋ค์ ํํ๊ฐ ํฌ๊ฒ ๋ค๋ฅด๊ธฐ ๋๋ฌธ์ ์๋ก ํธํ์ฑ์ด ๋ถ์กฑํ์ฌ ์ฐ๊ตฌ๋ค์ ๊ฐ๋๊ฐ ๋ง์ด ๋๋์ด์ง๊ณ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์ ์ ํ๊ณ ์ํ๋ ๋ฌธ์ ๋ ํํธํ๋ ๋ฐฑ๋ณธ ๋คํธ์ํฌ์ ๊ตฌ์กฐ๋ค์ ๊ตฌ์ ๋ฐ์ง ์๊ณ 3์ฐจ์ ๊ฐ์ฒด ๊ฒ์ถ๊ธฐ์ ์ฑ๋ฅ์ ํฅ์์ํฌ ๋ฐฉ๋ฒ์ด ์๋๊ฐ ์ด๋ค. ์ด๋ฅผ ์ํด ๋ณธ ๋
ผ๋ฌธ ์์๋ ํฌ์ธํธ ํด๋ผ์ฐ๋ ๋ฐ์ดํฐ ๊ธฐ๋ฐ์ 3์ฐจ์ ๊ฐ์ฒด ๊ฒ์ถ ๊ธฐ์ ์ ํฅ์์ํค๋ ๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ ์ ์ํ๋ค.
์ฒซ ๋ฒ์งธ๋ 3์ฐจ์ ๊ฒฝ๊ณ ์์ (3D bounding box) ์ ๊ตฌ์กฐ์ ์ธ ์ ๋ณด์ ํ์ฉ์ ์ต๋ํํ๋ ๊ตฌ์กฐ ๊ฐ์ํ ๋ฐ์ดํฐ ์ฆ๊ฐ (PA-AUG) ๊ธฐ๋ฒ์ด๋ค. 3์ฐจ์ ๊ฒฝ๊ณ ์์ ๋ผ๋ฒจ์ ๊ฐ์ฒด์ ๋ฑ ๋ง๊ฒ ์์ฑ๋๊ณ ๋ฐฉํฅ๊ฐ์ ํฌํจํ๊ธฐ ๋๋ฌธ์ ์์ ๋ด์ ๊ฐ์ฒด์ ๊ตฌ์กฐ ์ ๋ณด๋ฅผ ํฌํจํ๊ณ ์๋ค. ์ด๋ฅผ ํ์ฉํ๊ธฐ ์ํด ์ฐ๋ฆฌ๋ 3์ฐจ์ ๊ฒฝ๊ณ ์์๋ฅผ ๊ตฌ์กฐ ๊ฐ์ํ ํํฐ์
์ผ๋ก ๊ตฌ๋ถํ๋ ๋ฐฉ์์ ์ ์ํ๊ณ , ํํฐ์
์์ค์์ ์ํ๋๋ ์๋ก์ด ๋ฐฉ์์ ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. PA-AUG๋ ๋ค์ํ ํํ์ 3์ฐจ์ ๊ฐ์ฒด ๊ฒ์ถ๊ธฐ๋ค์ ์ฑ๋ฅ์ ๊ฐ์ธํ๊ฒ ๋ง๋ค์ด์ฃผ๊ณ , ํ์ต ๋ฐ์ดํฐ๋ฅผ 2.5๋ฐฐ ์ฆ ๊ฐ์ํค๋ ๋งํผ์ ์ธ์ ์ฑ๋ฅ ํฅ์ ํจ๊ณผ๋ฅผ ๋ณด์ฌ์ค๋ค.
๋ ๋ฒ์งธ๋ ํผํฉ ๋ฐ๋ ์ ๊ฒฝ๋ง ๊ธฐ๋ฐ 3์ฐจ์ ๊ฐ์ฒด ๊ฒ์ถ (MD3D) ๊ธฐ๋ฒ์ด๋ค. MD3D๋ ๊ฐ์ฐ์๊ฐ ํผํฉ ๋ชจ๋ธ (Gaussian Mixture Model) ์ ์ด์ฉํด 3์ฐจ์ ๊ฒฝ ๊ณ ์์ ํ๊ท ๋ฌธ์ ๋ฅผ ๋ฐ๋ ์์ธก ๋ฐฉ์์ผ๋ก ์ฌ์ ์ํ ๊ธฐ๋ฒ์ด๋ค. ์ด๋ฌํ ๋ฐฉ์์ ๊ธฐ์กด์ ๋ผ๋ฒจ ํ ๋น์์ ํ์ต ๋ฐฉ๋ฒ๋ค๊ณผ ๋ฌ๋ฆฌ ํฌ์ธํธ ํด๋ผ์ฐ๋ ์ ์ฒ๋ฆฌ ํํ์ ๊ตฌ์ ๋ฐ์ง ์๊ณ ๋์ผํ ํ์ต ๋ฐฉ์์ ์ ์ฉํ ์ ์๋ค. ๋ํ ๊ธฐ์กด ๋ฐฉ์ ๋๋น ํ์ต ์ ํ์ํ ํ์ดํผ ํ๋ผ๋ฏธํฐ๊ฐ ํ์ ํ ์ ์ด์ ์ต์ ํ๊ฐ ์ฉ์ดํ์ฌ ์ธ์ ์ฑ๋ฅ์ ํฌ๊ฒ ๋์ผ ์ ์์ ๋ฟ๋ง ์๋๋ผ ๊ฐ๋จํ ๊ตฌ์กฐ๋ก ์ธํด ์ธ์ ์๋๋ ๋นจ๋ผ์ง๊ฒ ๋๋ค.
PA-AUG์ MD3D๋ ๋ชจ๋ ๋ฐฑ๋ณธ ๋คํธ์ํฌ ๊ตฌ์กฐ์ ์๊ด์์ด ๋ค์ํ 3์ฐจ์ ๊ฐ์ฒด ๊ฒ์ถ๊ธฐ์ ๊ณตํต์ ์ผ๋ก ์ฌ์ฉ๋ ์ ์์ผ๋ฉฐ ๋์ ์ธ์ ์ฑ๋ฅ ํฅ์์ ๋ณด์ฌ์ค๋ค. ๋ฟ๋ง ์๋๋ผ ๋ ๊ธฐ๋ฒ์ ๊ฒ์ถ๊ธฐ์ ์๋ก ๋ค๋ฅธ ์์ญ์ ์ ์ฉ๋๋ ๊ธฐ๋ฒ์ด๋ฏ๋ก ํจ๊ป ๋์์ ์ฌ์ฉํ ์ ์๊ณ , ํจ๊ป ์ฌ์ฉํ์๋ ์ธ์ ์ฑ๋ฅ์ด ๋์ฑ ํฌ๊ฒ ํฅ์๋๋ค.LiDAR (Light Detection And Ranging), which is widely used as a sensing device for autonomous vehicles and robots, emits laser pulses and calculates the return time to sense the surrounding environment in the form of a point cloud. When recognizing the surrounding environment, the most important part is recognizing what objects are nearby and where they are located, and 3D object detection methods using point clouds have been actively studied to perform these tasks.
Various backbone networks for point cloud-based 3D object detection have been proposed according to the preprocessing method of point cloud data. Although advanced backbone networks have made great strides in detection performance, they are largely different in structure, so there is a lack of compatibility with each other. The problem to be solved in this dissertation is How to improve the performance of 3D object detectors regardless of their diverse backbone network structures?. This dissertation proposes two general methods to improve point cloud-based 3D object detectors.
First, we propose a part-aware data augmentation (PA-AUG) method which maximizes the utilization of structural information of 3D bounding boxes. Since the 3D bounding box labels fit the objects boundaries and include the orientation value, they contain the structural information of the object in the box. To fully utilize the intra-object structural information, we propose a novel partaware partitioning method which separates 3D bounding boxes with characteristic sub-parts. PA-AUG applies newly proposed data augmentation methods at the partition level. It makes various types of 3D object detectors robust and brings the equivalent effect of increasing the train data by about 2.5ร.
Second, we propose a mixture-density-based 3D object detection (MD3D). MD3D predicts the distribution of 3D bounding boxes using a Gaussian mixture model (GMM). It reformulates the conventional regression methods as a density estimation problem. Thus, unlike conventional target assignment methods, it can be applied to any 3D object detector regardless of the point cloud preprocessing method. In addition, as it requires significantly fewer hyper-parameters compared to existing methods, it is easy to optimize the detection performance. MD3D also increases the detection speed due to its simple structure.
Both PA-AUG and MD3D can be applied to any 3D object detector and shows an impressive increase in detection performance. The two proposed methods cover different stages of the object detection pipeline. Thus, they can be used simultaneously, and the experimental results show they have a synergy effect when applied together.1 Introduction 1
1.1 Problem Definition 3
1.2 Challenges 6
1.3 Contributions 8
1.3.1 Part-Aware Data Augmentation (PA-AUG) 8
1.3.2 Mixture-Density-based 3D Object Detection (MD3D) 9
1.3.3 Combination of PA-AUG and MD3D 10
1.4 Outline 10
2 Related Works 11
2.1 Data augmentation for Object Detection 11
2.1.1 2D Data augmentation 11
2.1.2 3D Data augmentation 12
2.2 LiDAR-based 3D Object Detection 13
2.3 Mixture Density Networks in Computer Vision 15
2.4 Datasets 16
2.4.1 KITTI Dataset 16
2.4.2 Waymo Open Dataset 18
2.5 Evaluation metric 19
2.5.1 Average Precision (AP) 19
2.5.2 Average Orientation Similarity (AOS) 22
2.5.3 Average Precision weighted by Heading (APH) 22
3 Part-Aware Data Augmentation (PA-AUG) 24
3.1 Introduction 24
3.2 Methods 27
3.2.1 Part-Aware Partitioning 27
3.2.2 Part-Aware Data Augmentation 28
3.3 Experiments 33
3.3.1 Results on the KITTI Dataset 33
3.3.2 Robustness Test 36
3.3.3 Data Efficiency Test 38
3.3.4 Ablation Study 40
3.4 Discussion 41
3.5 Conclusion 42
4 Mixture-Density-based 3D Object Detection (MD3D) 43
4.1 Introduction 43
4.2 Methods 47
4.2.1 Modeling Point-cloud-based 3D Object Detection with Mixture Density Network 47
4.2.2 Network Architecture 49
4.2.3 Loss function 52
4.3 Experiments 53
4.3.1 Datasets 53
4.3.2 Experiment Settings 53
4.3.3 Results on the KITTI Dataset 54
4.3.4 Latency of Each Module 56
4.3.5 Results on the Waymo Open Dataset 58
4.3.6 Analyzing Recall by object size 59
4.3.7 Ablation Study 60
4.3.8 Discussion 65
4.4 Conclusion 66
5 Combination of PA-AUG and MD3D 71
5.1 Methods 71
5.2 Experiments 72
5.2.1 Settings 72
5.2.2 Results on the KITTI Dataset 73
5.3 Discussion 76
6 Conclusion 77
6.1 Summary 77
6.2 Limitations and Future works 78
6.2.1 Hyper-parameter-free PA-AUG 78
6.2.2 Redefinition of Part-aware Partitioning 79
6.2.3 Application to other tasks 79
Abstract (In Korean) 94
๊ฐ์ฌ์ ๊ธ 96๋ฐ
Multi-View 3D Object Detection Network for Autonomous Driving
This paper aims at high-accuracy 3D object detection in autonomous driving
scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework
that takes both LIDAR point cloud and RGB images as input and predicts oriented
3D bounding boxes. We encode the sparse 3D point cloud with a compact
multi-view representation. The network is composed of two subnetworks: one for
3D object proposal generation and another for multi-view feature fusion. The
proposal network generates 3D candidate boxes efficiently from the bird's eye
view representation of 3D point cloud. We design a deep fusion scheme to
combine region-wise features from multiple views and enable interactions
between intermediate layers of different paths. Experiments on the challenging
KITTI benchmark show that our approach outperforms the state-of-the-art by
around 25% and 30% AP on the tasks of 3D localization and 3D detection. In
addition, for 2D detection, our approach obtains 10.3% higher AP than the
state-of-the-art on the hard data among the LIDAR-based methods.Comment: To appear in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) 201
DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation
In real-world crowd counting applications, the crowd densities vary greatly
in spatial and temporal domains. A detection based counting method will
estimate crowds accurately in low density scenes, while its reliability in
congested areas is downgraded. A regression based approach, on the other hand,
captures the general density information in crowded regions. Without knowing
the location of each person, it tends to overestimate the count in low density
areas. Thus, exclusively using either one of them is not sufficient to handle
all kinds of scenes with varying densities. To address this issue, a novel
end-to-end crowd counting framework, named DecideNet (DEteCtIon and Density
Estimation Network) is proposed. It can adaptively decide the appropriate
counting mode for different locations on the image based on its real density
conditions. DecideNet starts with estimating the crowd density by generating
detection and regression based density maps separately. To capture inevitable
variation in densities, it incorporates an attention module, meant to
adaptively assess the reliability of the two types of estimations. The final
crowd counts are obtained with the guidance of the attention module to adopt
suitable estimations from the two kinds of density maps. Experimental results
show that our method achieves state-of-the-art performance on three challenging
crowd counting datasets.Comment: CVPR 201
Event-Based Motion Segmentation by Motion Compensation
In contrast to traditional cameras, whose pixels have a common exposure time,
event-based cameras are novel bio-inspired sensors whose pixels work
independently and asynchronously output intensity changes (called "events"),
with microsecond resolution. Since events are caused by the apparent motion of
objects, event-based cameras sample visual information based on the scene
dynamics and are, therefore, a more natural fit than traditional cameras to
acquire motion, especially at high speeds, where traditional cameras suffer
from motion blur. However, distinguishing between events caused by different
moving objects and by the camera's ego-motion is a challenging task. We present
the first per-event segmentation method for splitting a scene into
independently moving objects. Our method jointly estimates the event-object
associations (i.e., segmentation) and the motion parameters of the objects (or
the background) by maximization of an objective function, which builds upon
recent results on event-based motion-compensation. We provide a thorough
evaluation of our method on a public dataset, outperforming the
state-of-the-art by as much as 10%. We also show the first quantitative
evaluation of a segmentation algorithm for event cameras, yielding around 90%
accuracy at 4 pixels relative displacement.Comment: When viewed in Acrobat Reader, several of the figures animate. Video:
https://youtu.be/0q6ap_OSBA
Road pollution estimation using static cameras and neural networks
Este artรญculo presenta una metodologรญa para estimar la contaminaciรณn en carreteras mediante el anรกlisis de secuencias de video de trรกfico. El objetivo es aprovechar la gran red de cรกmaras IP existente en el sistema de carreteras de cualquier estado o paรญs para estimar la contaminaciรณn en cada รกrea. Esta propuesta utiliza redes neuronales de aprendizaje profundo para la detecciรณn de objetos, y un modelo de estimaciรณn de contaminaciรณn basado en la frecuencia de vehรญculos y su velocidad. Los experimentos muestran prometedores resultados que sugieren que el sistema se puede usar en solitario o combinado con los sistemas existentes para medir la contaminaciรณn en carreteras.Universidad de Mรกlaga. Campus de Excelencia Internacional Andalucรญa Tech
- โฆ