Search CORE

761 research outputs found

크기, 폐색 및 레이블에 대한 어려운 조건 하 물체 검출 개선

Author: 노준혁
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 김건희.물체 검출은 인스턴스 분할(instance segmentation), 물체 추적(object tracking), 이미지 캡션(image captioning), 장면 이해(scene understanding), 행동 인식(action recognition) 등 고차원의 컴퓨터 비전 태스크(task)뿐만 아니라, 비디오 감시(video surveillance), 자율주행차(self-driving car), 로봇 비전(robot vision), 증강현실(augmented reality) 등 실제 어플리케이션(application)에도 다양하게 적용되는 중요한 분야이다. 이러한 중요성에 의해 본 분야는 수십 년이라는 오랜 기간 동안 연구되어 왔으며, 특히 딥러닝의 등장과 함께 크게 발전하였다. 하지만 이런 발전에도 불구하고, 물체 검출을 하는 데 있어 어려움을 겪게 되는 조건들이 여전히 존재한다. 본 논문에서는 일반적으로 잘 알려진 세 가지 어려운 조건들에 대해, 기존의 대표적인 검출 모형들이 더 잘 대응할 수 있도록 개선하는 것을 목표로 한다. 먼저, 보행자 검출 문제에서 발생하는 폐색(occlusion)과 어려운 비물체(hard negative)에 대한 문제를 다룬다. 가려진 보행자의 경우에는 배경으로, 수직 물체와 같은 비물체는 반대로 보행자로 인식되는 경우가 많아 전체적인 성능 저하의 큰 원인이 된다. 보행자 검출 문제는 실시간 처리를 필요로 하는 경우가 많기 때문에, 본 논문에서는 속도 측면에서 장점이 있는 일단계(single-stage) 검출 모형을 개선하여 두 가지 문제를 완화할 수 있는 방법론을 제안한다. 제안된 방법론은 사람의 부위(part) 및 이미지의 격자(grid)에 대한 신뢰도 높은 분류 결과를 바탕으로, 기존 예측 결과를 보정하는 후처리 방식으로 이루어진다. 다음으로는, 일반적인 물체 검출 문제에서 발생하는 작은 물체에 대한 문제를 다룬다. 특히, 정확도 측면에서 장점이 있는 이단계(two-stage) 검출 모형에서조차 작은 물체 검출에 대한 정확도는 크게 떨어지는 편이다. 그 이유는 작은 영역에 해당하는 피쳐(feature)의 정보가 매우 부족하기 때문이다. 본 논문에서는 이단계 검출 모형을 기반으로, 피쳐 수준 초해상도(super-resolution) 방법론을 도입하여 이 문제를 해결하는 것을 소개한다. 특히, 초해상도를 위한 목표(target) 피쳐를 설정해줌으로써 학습을 안정화하고, 성능을 더욱 향상시킨다. 마지막으로, 학습환경에서 분류 레이블(classification label)만이 주어지는 약지도(weakly supervised) 학습환경에서 발생하는 문제를 다룬다. 약지도 물체 검출(weakly supervised object localization)은 일반적인 물체 검출에 대해, 이미지별 하나의 물체가 주어지고, 그 물체의 클래스(class) 정보만 학습에 활용 가능하다는 제약이 추가된 문제이다. 이에 대한 대표적인 검출 방법론은 피쳐의 활성(activation) 값들을 활용하는 CAM(class activation mapping)을 들 수 있다. 하지만 해당 방법론을 사용하는 경우, 예측 영역이 실제 물체의 영역에 비해 굉장히 좁게 잡히는 문제가 있는데, 이는 분류에 필요한 정보를 가진 피쳐들의 영역이 좁고, 검출 과정에서 이들의 비중을 높게 처리하기 때문에 발생한다. 본 논문에서는 이를 개선하여, 다양한 피쳐들의 정보를 검출에 최적화된 방식으로 활용하여 물체의 영역을 정확히 예측할 수 있는 방법론을 제안한다. 제안한 방법론들은 각 문제의 대표적인 벤치마크(benchmark) 데이터셋들에 대해 기존의 검출 모형들의 성능을 크게 향상시켰으며, 일부 환경에서는 최고 수준(state-of-the-art)의 성능을 달성하였다. 또한 다양한 모형들에 적용 가능한 유연성(flexibility)을 바탕으로, 추후 발전된 모형들에도 적용하여 추가적인 성능향상을 가져올 수 있을 것으로 기대된다.Object detection is one of the most essential and fundamental fields in computer vision. It is the foundation of not only the high-level vision tasks such as instance segmentation, object tracking, image captioning, scene understanding, and action recognition, but also the real-world applications such as video surveillance, self-driving car, robot vision, and augmented reality. Due to its important role in computer vision, object detection has been studied for decades, and drastically developed with the emergence of deep neural networks. Despite the recent rapid advancement, however, the performance of many detection models is limited under certain conditions. In this thesis, we examine three challenging conditions that hinder the robust application of object detection models and propose novel approaches to resolve the problems caused by the challenging conditions. We first investigate how to improve the performance of detecting occluded objects and hard negatives in the domain of pedestrian detection. Occluded pedestrians are often recognized as background, whereas hard negative examples such as vertical objects are considered as pedestrians, which significantly degrades the detection performance. Since pedestrian detection often requires real-time processing, we propose a method that can alleviate two problems by improving a single-stage detection model with the advantage in terms of speed. More specifically, we introduce an additional post-processing module that refines initial prediction results based on reliable classification of a person's body parts and grids of image. We then study how to better detect small objects for general object classes. Although two-stage object detection models significantly outperform single-stage models in terms of accuracy, the performance of two-stage models on small objects is still much lower than human-level performance. It is mainly due to the lack of information in the features of a small region of interest. In this thesis, we propose a feature-level super-resolution method based on two-stage object detection models to improve the performance of detecting small objects. More specifically, by properly pairing input and target features for super-resolution, we stabilize the training process, and as a result, significantly improve the detection performance on small objects. Lastly, we address the object detection problem under the setting of weak supervision. Particularly, weakly supervised object localization (WSOL) assumes there is only one object per image, and only provides class labels for training. For the absence of bounding box annotation, one dominant approach for WSOL has used class activation maps (CAM), which are generated through training for classification, and used to estimate the location of objects. However, since a classification model is trained to concentrate on the discriminative features, the localization results are often limited to small object region. To resolve this problem, we propose the methods that properly utilize the information in class activation maps. Our proposed methods significantly improved the performance of base models on each benchmark dataset and achieved state-of-the-art performance in some settings. Based on the flexibility that is applicable to the various models, it is also expected to be applied to the more recent models, resulting in additional performance improvements.1. Introduction 1 1.1. Contributions 6 1.2. Thesis Organization 8 2. Related Work 9 2.1. General Methods 9 2.2. Methods for Occluded Objects and Hard Negatives 10 2.3. Methods for Small Objects 12 2.4. Methods for Weakly Labeled Objects 14 3. Part and Grid Classification Based Post-Refinement for Occluded Objects and Hard Negatives 17 3.1. Overview 17 3.2. Our Approach 19 3.2.1. A Unified View of Output Tensors 19 3.2.2. Refinement for Occlusion Handling 21 3.2.3. Refinement for Hard Negative Handling 25 3.3. Experiment Settings 28 3.3.1. Datasets 29 3.3.2. Configuration Details 30 3.4. Experiment Results 32 3.4.1. Quantitative Results 32 3.4.2. Ablation Experiments 36 3.4.3. Memory and Computation Time Analysis 40 3.4.4. Qualitative Results 41 3.5. Conclusion 44 4. Self-Supervised Feature Super-Resolution for Small Objects 45 4.1. Overview 45 4.2. Mismatch of Relative Receptive Fields 48 4.3. Our Approach 50 4.3.1. Super-Resolution Target Extractor 52 4.3.2. Super-Resolution Feature Generator and Discriminator 54 4.3.3. Training 56 4.3.4. Inference 57 4.4. Experiment Settings 57 4.4.1. Datasets 57 4.4.2. Configuration Details 58 4.5. Experiment Results 59 4.5.1. Quantitative Results 59 4.5.2. Ablation Experiments 61 4.5.3. Qualitative Results 67 4.6. Conclusion 67 5. Rectified Class Activation Mapping for Weakly Labeled Objects 72 5.1. Overview 72 5.2. Our Approach 75 5.2.1. GAP-based CAM Localization 75 5.2.2. Thresholded Average Pooling 76 5.2.3. Negative Weight Clamping 78 5.2.4. Percentile as a Thresholding Standard 80 5.3. Experiment Settings 81 5.3.1. Datasets 81 5.3.2. Configuration Details 82 5.4. Experiment Results 82 5.4.1. Quantitative Results 83 5.4.2. Ablation Experiments 88 5.4.3. Qualitative Results 90 5.5. Conclusion 96 6. Conclusion 98 6.1. Summary 98 6.2. Future Works 99Docto

SNU Open Repository and Archive