151 research outputs found

    크기, 폐색 및 λ ˆμ΄λΈ”μ— λŒ€ν•œ μ–΄λ €μš΄ 쑰건 ν•˜ 물체 κ²€μΆœ κ°œμ„ 

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀,2020. 2. 김건희.물체 κ²€μΆœμ€ μΈμŠ€ν„΄μŠ€ λΆ„ν• (instance segmentation), 물체 좔적(object tracking), 이미지 μΊ‘μ…˜(image captioning), μž₯λ©΄ 이해(scene understanding), 행동 인식(action recognition) λ“± κ³ μ°¨μ›μ˜ 컴퓨터 λΉ„μ „ νƒœμŠ€ν¬(task)뿐만 μ•„λ‹ˆλΌ, λΉ„λ””μ˜€ κ°μ‹œ(video surveillance), μžμœ¨μ£Όν–‰μ°¨(self-driving car), λ‘œλ΄‡ λΉ„μ „(robot vision), μ¦κ°•ν˜„μ‹€(augmented reality) λ“± μ‹€μ œ μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜(application)에도 λ‹€μ–‘ν•˜κ²Œ μ μš©λ˜λŠ” μ€‘μš”ν•œ 뢄야이닀. μ΄λŸ¬ν•œ μ€‘μš”μ„±μ— μ˜ν•΄ λ³Έ λΆ„μ•ΌλŠ” μˆ˜μ‹­ λ…„μ΄λΌλŠ” 였랜 κΈ°κ°„ λ™μ•ˆ μ—°κ΅¬λ˜μ–΄ μ™”μœΌλ©°, 특히 λ”₯λŸ¬λ‹μ˜ λ“±μž₯κ³Ό ν•¨κ»˜ 크게 λ°œμ „ν•˜μ˜€λ‹€. ν•˜μ§€λ§Œ 이런 λ°œμ „μ—λ„ λΆˆκ΅¬ν•˜κ³ , 물체 κ²€μΆœμ„ ν•˜λŠ” 데 μžˆμ–΄ 어렀움을 κ²ͺ게 λ˜λŠ” 쑰건듀이 μ—¬μ „νžˆ μ‘΄μž¬ν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 일반적으둜 잘 μ•Œλ €μ§„ μ„Έ 가지 μ–΄λ €μš΄ 쑰건듀에 λŒ€ν•΄, 기쑴의 λŒ€ν‘œμ μΈ κ²€μΆœ λͺ¨ν˜•λ“€μ΄ 더 잘 λŒ€μ‘ν•  수 μžˆλ„λ‘ κ°œμ„ ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. λ¨Όμ €, λ³΄ν–‰μž κ²€μΆœ λ¬Έμ œμ—μ„œ λ°œμƒν•˜λŠ” 폐색(occlusion)κ³Ό μ–΄λ €μš΄ 비물체(hard negative)에 λŒ€ν•œ 문제λ₯Ό 닀룬닀. 가렀진 λ³΄ν–‰μžμ˜ κ²½μš°μ—λŠ” 배경으둜, 수직 물체와 같은 λΉ„λ¬Όμ²΄λŠ” λ°˜λŒ€λ‘œ λ³΄ν–‰μžλ‘œ μΈμ‹λ˜λŠ” κ²½μš°κ°€ λ§Žμ•„ 전체적인 μ„±λŠ₯ μ €ν•˜μ˜ 큰 원인이 λœλ‹€. λ³΄ν–‰μž κ²€μΆœ λ¬Έμ œλŠ” μ‹€μ‹œκ°„ 처리λ₯Ό ν•„μš”λ‘œ ν•˜λŠ” κ²½μš°κ°€ 많기 λ•Œλ¬Έμ—, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 속도 μΈ‘λ©΄μ—μ„œ μž₯점이 μžˆλŠ” 일단계(single-stage) κ²€μΆœ λͺ¨ν˜•μ„ κ°œμ„ ν•˜μ—¬ 두 가지 문제λ₯Ό μ™„ν™”ν•  수 μžˆλŠ” 방법둠을 μ œμ•ˆν•œλ‹€. μ œμ•ˆλœ 방법둠은 μ‚¬λžŒμ˜ λΆ€μœ„(part) 및 μ΄λ―Έμ§€μ˜ 격자(grid)에 λŒ€ν•œ 신뒰도 높은 λΆ„λ₯˜ κ²°κ³Όλ₯Ό λ°”νƒ•μœΌλ‘œ, κΈ°μ‘΄ 예츑 κ²°κ³Όλ₯Ό λ³΄μ •ν•˜λŠ” ν›„μ²˜λ¦¬ λ°©μ‹μœΌλ‘œ 이루어진닀. λ‹€μŒμœΌλ‘œλŠ”, 일반적인 물체 κ²€μΆœ λ¬Έμ œμ—μ„œ λ°œμƒν•˜λŠ” μž‘μ€ 물체에 λŒ€ν•œ 문제λ₯Ό 닀룬닀. 특히, 정확도 μΈ‘λ©΄μ—μ„œ μž₯점이 μžˆλŠ” 이단계(two-stage) κ²€μΆœ λͺ¨ν˜•μ—μ„œμ‘°μ°¨ μž‘μ€ 물체 κ²€μΆœμ— λŒ€ν•œ μ •ν™•λ„λŠ” 크게 λ–¨μ–΄μ§€λŠ” νŽΈμ΄λ‹€. κ·Έ μ΄μœ λŠ” μž‘μ€ μ˜μ—­μ— ν•΄λ‹Ήν•˜λŠ” 피쳐(feature)의 정보가 맀우 λΆ€μ‘±ν•˜κΈ° λ•Œλ¬Έμ΄λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 이단계 κ²€μΆœ λͺ¨ν˜•μ„ 기반으둜, 피쳐 μˆ˜μ€€ μ΄ˆν•΄μƒλ„(super-resolution) 방법둠을 λ„μž…ν•˜μ—¬ 이 문제λ₯Ό ν•΄κ²°ν•˜λŠ” 것을 μ†Œκ°œν•œλ‹€. 특히, μ΄ˆν•΄μƒλ„λ₯Ό μœ„ν•œ λͺ©ν‘œ(target) 피쳐λ₯Ό μ„€μ •ν•΄μ€ŒμœΌλ‘œμ¨ ν•™μŠ΅μ„ μ•ˆμ •ν™”ν•˜κ³ , μ„±λŠ₯을 λ”μš± ν–₯μƒμ‹œν‚¨λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, ν•™μŠ΅ν™˜κ²½μ—μ„œ λΆ„λ₯˜ λ ˆμ΄λΈ”(classification label)만이 μ£Όμ–΄μ§€λŠ” 약지도(weakly supervised) ν•™μŠ΅ν™˜κ²½μ—μ„œ λ°œμƒν•˜λŠ” 문제λ₯Ό 닀룬닀. 약지도 물체 κ²€μΆœ(weakly supervised object localization)은 일반적인 물체 κ²€μΆœμ— λŒ€ν•΄, 이미지별 ν•˜λ‚˜μ˜ 물체가 주어지고, κ·Έ 물체의 클래슀(class) μ •λ³΄λ§Œ ν•™μŠ΅μ— ν™œμš© κ°€λŠ₯ν•˜λ‹€λŠ” μ œμ•½μ΄ μΆ”κ°€λœ λ¬Έμ œμ΄λ‹€. 이에 λŒ€ν•œ λŒ€ν‘œμ μΈ κ²€μΆœ 방법둠은 ν”Όμ³μ˜ ν™œμ„±(activation) 값듀을 ν™œμš©ν•˜λŠ” CAM(class activation mapping)을 λ“€ 수 μžˆλ‹€. ν•˜μ§€λ§Œ ν•΄λ‹Ή 방법둠을 μ‚¬μš©ν•˜λŠ” 경우, 예츑 μ˜μ—­μ΄ μ‹€μ œ 물체의 μ˜μ—­μ— λΉ„ν•΄ ꡉμž₯히 쒁게 μž‘νžˆλŠ” λ¬Έμ œκ°€ μžˆλŠ”λ°, μ΄λŠ” λΆ„λ₯˜μ— ν•„μš”ν•œ 정보λ₯Ό 가진 ν”Όμ³λ“€μ˜ μ˜μ—­μ΄ 쒁고, κ²€μΆœ κ³Όμ •μ—μ„œ μ΄λ“€μ˜ 비쀑을 λ†’κ²Œ μ²˜λ¦¬ν•˜κΈ° λ•Œλ¬Έμ— λ°œμƒν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 이λ₯Ό κ°œμ„ ν•˜μ—¬, λ‹€μ–‘ν•œ ν”Όμ³λ“€μ˜ 정보λ₯Ό κ²€μΆœμ— μ΅œμ ν™”λœ λ°©μ‹μœΌλ‘œ ν™œμš©ν•˜μ—¬ 물체의 μ˜μ—­μ„ μ •ν™•νžˆ μ˜ˆμΈ‘ν•  수 μžˆλŠ” 방법둠을 μ œμ•ˆν•œλ‹€. μ œμ•ˆν•œ 방법둠듀은 각 문제의 λŒ€ν‘œμ μΈ 벀치마크(benchmark) 데이터셋듀에 λŒ€ν•΄ 기쑴의 κ²€μΆœ λͺ¨ν˜•λ“€μ˜ μ„±λŠ₯을 크게 ν–₯μƒμ‹œμΌ°μœΌλ©°, 일뢀 ν™˜κ²½μ—μ„œλŠ” 졜고 μˆ˜μ€€(state-of-the-art)의 μ„±λŠ₯을 λ‹¬μ„±ν•˜μ˜€λ‹€. λ˜ν•œ λ‹€μ–‘ν•œ λͺ¨ν˜•λ“€μ— 적용 κ°€λŠ₯ν•œ μœ μ—°μ„±(flexibility)을 λ°”νƒ•μœΌλ‘œ, μΆ”ν›„ λ°œμ „λœ λͺ¨ν˜•λ“€μ—λ„ μ μš©ν•˜μ—¬ 좔가적인 μ„±λŠ₯ν–₯상을 κ°€μ Έμ˜¬ 수 μžˆμ„ κ²ƒμœΌλ‘œ κΈ°λŒ€λœλ‹€.Object detection is one of the most essential and fundamental fields in computer vision. It is the foundation of not only the high-level vision tasks such as instance segmentation, object tracking, image captioning, scene understanding, and action recognition, but also the real-world applications such as video surveillance, self-driving car, robot vision, and augmented reality. Due to its important role in computer vision, object detection has been studied for decades, and drastically developed with the emergence of deep neural networks. Despite the recent rapid advancement, however, the performance of many detection models is limited under certain conditions. In this thesis, we examine three challenging conditions that hinder the robust application of object detection models and propose novel approaches to resolve the problems caused by the challenging conditions. We first investigate how to improve the performance of detecting occluded objects and hard negatives in the domain of pedestrian detection. Occluded pedestrians are often recognized as background, whereas hard negative examples such as vertical objects are considered as pedestrians, which significantly degrades the detection performance. Since pedestrian detection often requires real-time processing, we propose a method that can alleviate two problems by improving a single-stage detection model with the advantage in terms of speed. More specifically, we introduce an additional post-processing module that refines initial prediction results based on reliable classification of a person's body parts and grids of image. We then study how to better detect small objects for general object classes. Although two-stage object detection models significantly outperform single-stage models in terms of accuracy, the performance of two-stage models on small objects is still much lower than human-level performance. It is mainly due to the lack of information in the features of a small region of interest. In this thesis, we propose a feature-level super-resolution method based on two-stage object detection models to improve the performance of detecting small objects. More specifically, by properly pairing input and target features for super-resolution, we stabilize the training process, and as a result, significantly improve the detection performance on small objects. Lastly, we address the object detection problem under the setting of weak supervision. Particularly, weakly supervised object localization (WSOL) assumes there is only one object per image, and only provides class labels for training. For the absence of bounding box annotation, one dominant approach for WSOL has used class activation maps (CAM), which are generated through training for classification, and used to estimate the location of objects. However, since a classification model is trained to concentrate on the discriminative features, the localization results are often limited to small object region. To resolve this problem, we propose the methods that properly utilize the information in class activation maps. Our proposed methods significantly improved the performance of base models on each benchmark dataset and achieved state-of-the-art performance in some settings. Based on the flexibility that is applicable to the various models, it is also expected to be applied to the more recent models, resulting in additional performance improvements.1. Introduction 1 1.1. Contributions 6 1.2. Thesis Organization 8 2. Related Work 9 2.1. General Methods 9 2.2. Methods for Occluded Objects and Hard Negatives 10 2.3. Methods for Small Objects 12 2.4. Methods for Weakly Labeled Objects 14 3. Part and Grid Classification Based Post-Refinement for Occluded Objects and Hard Negatives 17 3.1. Overview 17 3.2. Our Approach 19 3.2.1. A Unified View of Output Tensors 19 3.2.2. Refinement for Occlusion Handling 21 3.2.3. Refinement for Hard Negative Handling 25 3.3. Experiment Settings 28 3.3.1. Datasets 29 3.3.2. Configuration Details 30 3.4. Experiment Results 32 3.4.1. Quantitative Results 32 3.4.2. Ablation Experiments 36 3.4.3. Memory and Computation Time Analysis 40 3.4.4. Qualitative Results 41 3.5. Conclusion 44 4. Self-Supervised Feature Super-Resolution for Small Objects 45 4.1. Overview 45 4.2. Mismatch of Relative Receptive Fields 48 4.3. Our Approach 50 4.3.1. Super-Resolution Target Extractor 52 4.3.2. Super-Resolution Feature Generator and Discriminator 54 4.3.3. Training 56 4.3.4. Inference 57 4.4. Experiment Settings 57 4.4.1. Datasets 57 4.4.2. Configuration Details 58 4.5. Experiment Results 59 4.5.1. Quantitative Results 59 4.5.2. Ablation Experiments 61 4.5.3. Qualitative Results 67 4.6. Conclusion 67 5. Rectified Class Activation Mapping for Weakly Labeled Objects 72 5.1. Overview 72 5.2. Our Approach 75 5.2.1. GAP-based CAM Localization 75 5.2.2. Thresholded Average Pooling 76 5.2.3. Negative Weight Clamping 78 5.2.4. Percentile as a Thresholding Standard 80 5.3. Experiment Settings 81 5.3.1. Datasets 81 5.3.2. Configuration Details 82 5.4. Experiment Results 82 5.4.1. Quantitative Results 83 5.4.2. Ablation Experiments 88 5.4.3. Qualitative Results 90 5.5. Conclusion 96 6. Conclusion 98 6.1. Summary 98 6.2. Future Works 99Docto

    Text-Based Guidance for Improved Image Retrievalon Archival Image Dataset

    Get PDF
    Digitised archival photo collections allow members of the public to view images relating to history and democracy. Recent advancements in visual tasks such as Content Based Image Retrieval and the development of deep neural networks have provided modern methods to analyse digitised images and perform image queries for retrieval. We explore the image retrieval task using several publicly available datasets, and a set of archival images from the National Archives of Australia, and propose a simple change to existing pooling method to improve retrieval performance in the archival set. Another visual task of object localisation considers the ability of a model to be trained to adequately locate in an image the positions of objects, given English text phrases. With other recent advances in large-scale text embedding models, pre-trained text models retain rich semantic structure within them. While other methods of object localisation involve the training of text pathways in their deep neural model, we explore direct use of a large-scale text embedding for this task, and demonstrate its ability to localise objects, and even on unseen words
    • …
    corecore