761 research outputs found
ν¬κΈ°, νμ λ° λ μ΄λΈμ λν μ΄λ €μ΄ 쑰건 ν 물체 κ²μΆ κ°μ
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :곡과λν μ»΄ν¨ν°κ³΅νλΆ,2020. 2. κΉκ±΄ν¬.물체 κ²μΆμ μΈμ€ν΄μ€ λΆν (instance segmentation), 물체 μΆμ (object tracking), μ΄λ―Έμ§ μΊ‘μ
(image captioning), μ₯λ©΄ μ΄ν΄(scene understanding), νλ μΈμ(action recognition) λ± κ³ μ°¨μμ μ»΄ν¨ν° λΉμ νμ€ν¬(task)λΏλ§ μλλΌ, λΉλμ€ κ°μ(video surveillance), μμ¨μ£Όνμ°¨(self-driving car), λ‘λ΄ λΉμ (robot vision), μ¦κ°νμ€(augmented reality) λ± μ€μ μ΄ν리μΌμ΄μ
(application)μλ λ€μνκ² μ μ©λλ μ€μν λΆμΌμ΄λ€. μ΄λ¬ν μ€μμ±μ μν΄ λ³Έ λΆμΌλ μμ λ
μ΄λΌλ μ€λ κΈ°κ° λμ μ°κ΅¬λμ΄ μμΌλ©°, νΉν λ₯λ¬λμ λ±μ₯κ³Ό ν¨κ» ν¬κ² λ°μ νμλ€. νμ§λ§ μ΄λ° λ°μ μλ λΆκ΅¬νκ³ , 물체 κ²μΆμ νλ λ° μμ΄ μ΄λ €μμ κ²ͺκ² λλ 쑰건λ€μ΄ μ¬μ ν μ‘΄μ¬νλ€. λ³Έ λ
Όλ¬Έμμλ μΌλ°μ μΌλ‘ μ μλ €μ§ μΈ κ°μ§ μ΄λ €μ΄ 쑰건λ€μ λν΄, κΈ°μ‘΄μ λνμ μΈ κ²μΆ λͺ¨νλ€μ΄ λ μ λμν μ μλλ‘ κ°μ νλ κ²μ λͺ©νλ‘ νλ€.
λ¨Όμ , 보νμ κ²μΆ λ¬Έμ μμ λ°μνλ νμ(occlusion)κ³Ό μ΄λ €μ΄ λΉλ¬Όμ²΄(hard negative)μ λν λ¬Έμ λ₯Ό λ€λ£¬λ€. κ°λ €μ§ 보νμμ κ²½μ°μλ λ°°κ²½μΌλ‘, μμ§ λ¬Όμ²΄μ κ°μ λΉλ¬Όμ²΄λ λ°λλ‘ λ³΄νμλ‘ μΈμλλ κ²½μ°κ° λ§μ μ 체μ μΈ μ±λ₯ μ νμ ν° μμΈμ΄ λλ€. 보νμ κ²μΆ λ¬Έμ λ μ€μκ° μ²λ¦¬λ₯Ό νμλ‘ νλ κ²½μ°κ° λ§κΈ° λλ¬Έμ, λ³Έ λ
Όλ¬Έμμλ μλ μΈ‘λ©΄μμ μ₯μ μ΄ μλ μΌλ¨κ³(single-stage) κ²μΆ λͺ¨νμ κ°μ νμ¬ λ κ°μ§ λ¬Έμ λ₯Ό μνν μ μλ λ°©λ²λ‘ μ μ μνλ€. μ μλ λ°©λ²λ‘ μ μ¬λμ λΆμ(part) λ° μ΄λ―Έμ§μ 격μ(grid)μ λν μ λ’°λ λμ λΆλ₯ κ²°κ³Όλ₯Ό λ°νμΌλ‘, κΈ°μ‘΄ μμΈ‘ κ²°κ³Όλ₯Ό 보μ νλ νμ²λ¦¬ λ°©μμΌλ‘ μ΄λ£¨μ΄μ§λ€.
λ€μμΌλ‘λ, μΌλ°μ μΈ λ¬Όμ²΄ κ²μΆ λ¬Έμ μμ λ°μνλ μμ 물체μ λν λ¬Έμ λ₯Ό λ€λ£¬λ€. νΉν, μ νλ μΈ‘λ©΄μμ μ₯μ μ΄ μλ μ΄λ¨κ³(two-stage) κ²μΆ λͺ¨νμμμ‘°μ°¨ μμ 물체 κ²μΆμ λν μ νλλ ν¬κ² λ¨μ΄μ§λ νΈμ΄λ€. κ·Έ μ΄μ λ μμ μμμ ν΄λΉνλ νΌμ³(feature)μ μ λ³΄κ° λ§€μ° λΆμ‘±νκΈ° λλ¬Έμ΄λ€. λ³Έ λ
Όλ¬Έμμλ μ΄λ¨κ³ κ²μΆ λͺ¨νμ κΈ°λ°μΌλ‘, νΌμ³ μμ€ μ΄ν΄μλ(super-resolution) λ°©λ²λ‘ μ λμ
νμ¬ μ΄ λ¬Έμ λ₯Ό ν΄κ²°νλ κ²μ μκ°νλ€. νΉν, μ΄ν΄μλλ₯Ό μν λͺ©ν(target) νΌμ³λ₯Ό μ€μ ν΄μ€μΌλ‘μ¨ νμ΅μ μμ ννκ³ , μ±λ₯μ λμ± ν₯μμν¨λ€.
λ§μ§λ§μΌλ‘, νμ΅νκ²½μμ λΆλ₯ λ μ΄λΈ(classification label)λ§μ΄ μ£Όμ΄μ§λ μ½μ§λ(weakly supervised) νμ΅νκ²½μμ λ°μνλ λ¬Έμ λ₯Ό λ€λ£¬λ€. μ½μ§λ 물체 κ²μΆ(weakly supervised object localization)μ μΌλ°μ μΈ λ¬Όμ²΄ κ²μΆμ λν΄, μ΄λ―Έμ§λ³ νλμ λ¬Όμ²΄κ° μ£Όμ΄μ§κ³ , κ·Έ 물체μ ν΄λμ€(class) μ λ³΄λ§ νμ΅μ νμ© κ°λ₯νλ€λ μ μ½μ΄ μΆκ°λ λ¬Έμ μ΄λ€. μ΄μ λν λνμ μΈ κ²μΆ λ°©λ²λ‘ μ νΌμ³μ νμ±(activation) κ°λ€μ νμ©νλ CAM(class activation mapping)μ λ€ μ μλ€. νμ§λ§ ν΄λΉ λ°©λ²λ‘ μ μ¬μ©νλ κ²½μ°, μμΈ‘ μμμ΄ μ€μ 물체μ μμμ λΉν΄ κ΅μ₯ν μ’κ² μ‘νλ λ¬Έμ κ° μλλ°, μ΄λ λΆλ₯μ νμν μ 보λ₯Ό κ°μ§ νΌμ³λ€μ μμμ΄ μ’κ³ , κ²μΆ κ³Όμ μμ μ΄λ€μ λΉμ€μ λκ² μ²λ¦¬νκΈ° λλ¬Έμ λ°μνλ€. λ³Έ λ
Όλ¬Έμμλ μ΄λ₯Ό κ°μ νμ¬, λ€μν νΌμ³λ€μ μ 보λ₯Ό κ²μΆμ μ΅μ νλ λ°©μμΌλ‘ νμ©νμ¬ λ¬Όμ²΄μ μμμ μ νν μμΈ‘ν μ μλ λ°©λ²λ‘ μ μ μνλ€.
μ μν λ°©λ²λ‘ λ€μ κ° λ¬Έμ μ λνμ μΈ λ²€μΉλ§ν¬(benchmark) λ°μ΄ν°μ
λ€μ λν΄ κΈ°μ‘΄μ κ²μΆ λͺ¨νλ€μ μ±λ₯μ ν¬κ² ν₯μμμΌ°μΌλ©°, μΌλΆ νκ²½μμλ μ΅κ³ μμ€(state-of-the-art)μ μ±λ₯μ λ¬μ±νμλ€. λν λ€μν λͺ¨νλ€μ μ μ© κ°λ₯ν μ μ°μ±(flexibility)μ λ°νμΌλ‘, μΆν λ°μ λ λͺ¨νλ€μλ μ μ©νμ¬ μΆκ°μ μΈ μ±λ₯ν₯μμ κ°μ Έμ¬ μ μμ κ²μΌλ‘ κΈ°λλλ€.Object detection is one of the most essential and fundamental fields in computer vision. It is the foundation of not only the high-level vision tasks such as instance segmentation, object tracking, image captioning, scene understanding, and action recognition, but also the real-world applications such as video surveillance, self-driving car, robot vision, and augmented reality. Due to its important role in computer vision, object detection has been studied for decades, and drastically developed with the emergence of deep neural networks. Despite the recent rapid advancement, however, the performance of many detection models is limited under certain conditions. In this thesis, we examine three challenging conditions that hinder the robust application of object detection models and propose novel approaches to resolve the problems caused by the challenging conditions.
We first investigate how to improve the performance of detecting occluded objects and hard negatives in the domain of pedestrian detection. Occluded pedestrians are often recognized as background, whereas hard negative examples such as vertical objects are considered as pedestrians, which significantly degrades the detection performance. Since pedestrian detection often requires real-time processing, we propose a method that can alleviate two problems by improving a single-stage detection model with the advantage in terms of speed. More specifically, we introduce an additional post-processing module that refines initial prediction results based on reliable classification of a person's body parts and grids of image.
We then study how to better detect small objects for general object classes. Although two-stage object detection models significantly outperform single-stage models in terms of accuracy, the performance of two-stage models on small objects is still much lower than human-level performance. It is mainly due to the lack of information in the features of a small region of interest. In this thesis, we propose a feature-level super-resolution method based on two-stage object detection models to improve the performance of detecting small objects. More specifically, by properly pairing input and target features for super-resolution, we stabilize the training process, and as a result, significantly improve the detection performance on small objects.
Lastly, we address the object detection problem under the setting of weak supervision. Particularly, weakly supervised object localization (WSOL) assumes there is only one object per image, and only provides class labels for training. For the absence of bounding box annotation, one dominant approach for WSOL has used class activation maps (CAM), which are generated through training for classification, and used to estimate the location of objects. However, since a classification model is trained to concentrate on the discriminative features, the localization results are often limited to small object region. To resolve this problem, we propose the methods that properly utilize the information in class activation maps.
Our proposed methods significantly improved the performance of base models on each benchmark dataset and achieved state-of-the-art performance in some settings. Based on the flexibility that is applicable to the various models, it is also expected to be applied to the more recent models, resulting in additional performance improvements.1. Introduction 1
1.1. Contributions 6
1.2. Thesis Organization 8
2. Related Work 9
2.1. General Methods 9
2.2. Methods for Occluded Objects and Hard Negatives 10
2.3. Methods for Small Objects 12
2.4. Methods for Weakly Labeled Objects 14
3. Part and Grid Classification Based Post-Refinement for Occluded Objects and Hard Negatives 17
3.1. Overview 17
3.2. Our Approach 19
3.2.1. A Unified View of Output Tensors 19
3.2.2. Refinement for Occlusion Handling 21
3.2.3. Refinement for Hard Negative Handling 25
3.3. Experiment Settings 28
3.3.1. Datasets 29
3.3.2. Configuration Details 30
3.4. Experiment Results 32
3.4.1. Quantitative Results 32
3.4.2. Ablation Experiments 36
3.4.3. Memory and Computation Time Analysis 40
3.4.4. Qualitative Results 41
3.5. Conclusion 44
4. Self-Supervised Feature Super-Resolution for Small Objects 45
4.1. Overview 45
4.2. Mismatch of Relative Receptive Fields 48
4.3. Our Approach 50
4.3.1. Super-Resolution Target Extractor 52
4.3.2. Super-Resolution Feature Generator and Discriminator 54
4.3.3. Training 56
4.3.4. Inference 57
4.4. Experiment Settings 57
4.4.1. Datasets 57
4.4.2. Configuration Details 58
4.5. Experiment Results 59
4.5.1. Quantitative Results 59
4.5.2. Ablation Experiments 61
4.5.3. Qualitative Results 67
4.6. Conclusion 67
5. Rectified Class Activation Mapping for Weakly Labeled Objects 72
5.1. Overview 72
5.2. Our Approach 75
5.2.1. GAP-based CAM Localization 75
5.2.2. Thresholded Average Pooling 76
5.2.3. Negative Weight Clamping 78
5.2.4. Percentile as a Thresholding Standard 80
5.3. Experiment Settings 81
5.3.1. Datasets 81
5.3.2. Configuration Details 82
5.4. Experiment Results 82
5.4.1. Quantitative Results 83
5.4.2. Ablation Experiments 88
5.4.3. Qualitative Results 90
5.5. Conclusion 96
6. Conclusion 98
6.1. Summary 98
6.2. Future Works 99Docto
- β¦