15 research outputs found

    ์•ฝํ•œ ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ๋ฌผ์ฒด ํƒ์ง€์—์„œ์˜ ํ•™์Šต ๋ถ€๋‹ด์„ ์ค„์ด๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ์ˆ˜๋ฆฌ๊ณผํ•™๋ถ€, 2023. 2. ๊ฐ•๋ช…์ฃผ.In this thesis, we propose two models for weakly supervised object localization (WSOL). Many existing WSOL models have various burdens of learning, e.g., the nonnegligible cost of hyperparameter search for loss function. Thus, we first propose a model called SFPN to reduce the cost of hyperparameter search for loss function. SFPN enhances the information of the feature maps by exploiting the structure of feature pyramid network. Then these feature maps are engaged in the prediction of the bounding box. This process helps us use only cross-entropy loss as well as improving performance. Furthermore, we propose the second model named A2E Net to enjoy a smaller number of parameters. A2E Net consists of spatial attention branch and refinement branch. Spatial attention branch heightens the spatial information using few parameters. Also, refinement branch is composed of attention module and erasing module, and these modules have no trainable parameters. With the output feature map of spatial attention branch, attention module makes the feature map with more accurate information by using a connection between pixels. Also, erasing module erases the most discriminative region to make the network take account of the less discriminative region. Moreover, we boost the performance with multiple sizes of erasing. Finally, we sum up two output feature maps from attention module and erasing module to utilize information from these two modules. Extensive experiments on CUB-200-2011 and ILSVRC show the great performance of SFPN and A2E Net compared to other existing WSOL models.๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ์šฐ๋ฆฌ๋Š” ์•ฝํ•œ ์ง€๋„ ๊ธฐ๋ฐ˜์˜ ๋ฌผ์ฒดํƒ์ง€๋ฅผ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋งŽ์€ ์•ฝํ•œ ์ง€๋„ ๊ธฐ๋ฐ˜์˜ ๋ฌผ์ฒดํƒ์ง€๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ๋“ค์€ ์†์‹คํ•จ์ˆ˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ฐพ๊ธฐ์— ๋“ค์–ด๊ฐ€๋Š” ๋น„์šฉ์ด ๋ฌด์‹œํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋“ฑ์˜ ํ•œ๊ณ„์ ์ด ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๋Š” ๋จผ์ € ์ด ์†์‹คํ•จ์ˆ˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ฐพ๊ธฐ์— ๋“ค์–ด๊ฐ€๋Š” ๋น„์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ SFPN์ด๋ผ๋Š” ์ด๋ฆ„์„ ๊ฐ€์ง„ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. SFPN์€ ํŠน์ง• ํ”ผ๋ผ๋ฏธ๋“œ ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํŠน์ง• ๋งต๋“ค์˜ ์ •๋ณด๋ฅผ ๊ฐ•ํ™”์‹œ์ผฐ๋‹ค. ์ดํ›„์— ์ด ํŠน์ง• ๋งต๋“ค์€ ๊ฒฝ๊ณ„ ์ƒ์ž์˜ ์˜ˆ์ธก์— ์ฐธ์—ฌํ•œ๋‹ค. ์ด ๊ณผ์ •์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์˜ค์ง ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ํ•จ์ˆ˜๋งŒ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์™”๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์šฐ๋ฆฌ๋Š” ์ข€ ๋” ์ ์€ ๊ฐœ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๋‘ ๋ฒˆ์งธ ๋ชจ๋ธ์ธ A2E Net์„ ์ œ์•ˆํ•œ๋‹ค. ์ด ๋ชจ๋ธ์€ ๊ณต๊ฐ„ ์ง‘์ค‘ ๋ถ„๊ธฐ, ์ •์ œ ๋ถ„๊ธฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์šฐ์„ , ๊ณต๊ฐ„ ์ง‘์ค‘ ๋ถ„๊ธฐ๋Š” ์ ์€ ๊ฐœ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๊ฐ•ํ™”์‹œํ‚จ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ •์ œ ๋ถ„๊ธฐ๋Š” ์ง‘์ค‘ ๋ชจ๋“ˆ๊ณผ ์ง€์šฐ๊ธฐ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋˜๊ณ , ์ด ๋ชจ๋“ˆ๋“ค์€ ๋ชจ๋‘ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—†๋‹ค. ๊ณต๊ฐ„ ์ง‘์ค‘ ๋ถ„๊ธฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ, ์ง‘์ค‘ ๋ชจ๋“ˆ์€ ํ”ฝ์…€ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํŠน์ง• ๋งต์˜ ์ •๋ณด๋ฅผ ์ข€ ๋” ์ •๊ตํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค. ๋˜ํ•œ, ์ง€์šฐ๊ธฐ ๋ชจ๋“ˆ์€ ๊ณต๊ฐ„ ์ง‘์ค‘ ๋ถ„๊ธฐ์˜ ์ถœ๋ ฅ ํŠน์ง• ๋งต์˜ ๊ฐ€์žฅ ๊ตฌ๋ณ„๋˜๋Š” ์˜์—ญ์„ ์ง€์›Œ์„œ ๋„คํŠธ์›Œํฌ๊ฐ€ ๋œ ๊ตฌ๋ณ„๋˜๋Š” ์˜์—ญ๋„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ๋”์šฑ์ด ์ง€์šฐ๋Š” ์˜์—ญ์˜ ํฌ๊ธฐ๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ๋” ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์ง‘์ค‘๊ณผ ์ง€์šฐ๊ธฐ์—์„œ ๋‚˜์˜ค๋Š” ์ •๋ณด๋ฅผ ๋ชจ๋‘ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ด ๋‘ ๋ชจ๋“ˆ์˜ ์ถœ๋ ฅ ํŠน์ง• ๋งต๋“ค์„ ๋”ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์ œ์•ˆ๋œ SFPN๊ณผ A2E Net์€ CUB-200-2011๊ณผ ILSVRC ์—์„œ์˜ ์‹คํ—˜์„ ํ†ตํ•ด ๊ธฐ์กด์˜ ์•ฝ์ง€๋„ ๋ฌผ์ฒด ํƒ์ง€ ๊ธฐ๋ฒ•๋“ค๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ง์„ ๋ณด์˜€๋‹ค.1 Introduction 1 2 Preliminaries 5 2.1 Convolutional Neural Networks 5 2.1.1 Convolution Operation 5 2.1.2 Some Convolutional Neural Networks 7 3 SFPN: Simple Feature Pyramid Network for Weakly Supervised Object Localization 12 3.1 Introduction 12 3.2 Related works 14 3.2.1 Some Object Detection Methods 14 3.2.2 Existing Methods for Weakly Supervised Object Localization 18 3.3 Proposed Method 23 3.4 Experiment 26 3.4.1 Datasets 26 3.4.2 Evaluation Metrics 27 3.4.3 Implementation Details 28 3.4.4 Result 28 3.4.5 Ablation Study 30 4 A2E Net: Aggregation of Attention and Erasing for Weakly Supervised Object Localization 33 4.1 Introduction 33 4.2 Related Works 35 4.2.1 Attention Mechanism 35 4.2.2 Erasing Methods 40 4.2.3 Existing Methods for Weakly Supervised Object Localization 43 4.3 Proposed Method 48 4.3.1 Spatial Attention Branch 48 4.3.2 Refinement Branch 49 4.4 Experiment 56 4.4.1 Implementation Details 56 4.4.2 Result 57 4.4.3 Ablation Study 60 5 Conclusion 67 The bibliography 70 Abstract (in Korean) 78๋ฐ•

    Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

    Full text link
    Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision. Due to the local receptive fields generated by convolution operations, previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope. Benefiting from the capability of the self-attention mechanism to acquire long-range feature dependencies, Vision Transformer has been recently applied to alleviate the local activation drawbacks. However, since the transformer lacks the inductive localization bias that are inherent in CNNs, it may cause a divergent activation problem resulting in an uncertain distinction between foreground and background. In this work, we proposed a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation. Specifically, we first propose a local patch shuffle strategy to construct the image pairs, disrupting local patches while guaranteeing global consistency. The paired images that contain the common object in spatial are then fed into the Siamese network encoder. We further design a semantic-constraint matching module, which aims to mine the co-object part by matching the coarse class activation maps (CAMs) extracted from the pair images, thus implicitly guiding and calibrating the transformer network to alleviate the divergent activation. Extensive experimental results conducted on two challenging benchmarks, including CUB-200-2011 and ILSVRC datasets show that our method can achieve the new state-of-the-art performance and outperform the previous method by a large margin

    Novel meta-learning approaches for few-shot image classification

    Get PDF
    In recent years, there has been rapid progress in computing performance and communication techniques, leading to a surging interest in artificial intelligence. Artificial intelligence aims to achieve human intelligence by making a machine think. However, current machine learning and optimisation techniques are far from fully accomplishing this, suffering from several limitations. For example, humans can learn a new concept quickly from very few examples, while artificial intelligence algorithms usually require a large number of examples to extract useful patterns. To tackle this issue, the computer science community has recently delved into the challenge of learning from very limited data, also known as few-shot learning. Few-shot image classification is the most studied research field of few-shot learning, which attempts to learn a new visual concept from limited labelled images. The conventional deep learning techniques cannot be simply applied to solve the problem, hindered by two core issues of few-shot learning, namely lack of information and intrinsic uncertainties. Lack of information is related to the insufficient visual patterns in limited training data, and intrinsic uncertainties are reflected by unrepresentative examples and background clutters. To tackle the problems, recent approaches mostly incorporate meta-learning methods which learn the general knowledge about how to make a few-shot learning process easier and quicker from a collection of learning tasks. However, existing meta-learning approaches mostly focus on either of the two key problems of few-shot image classification. Very few existing works consider both of them at the same time. Therefore, there is a need for developing novel meta-learning approaches that take into account both problems simultaneously for few-shot image classification. The thesis focuses on developing novel strategies of meta-learning approaches for few-shot image classification through three progressive stages, with the goal of addressing the aforementioned two core issues concurrently from different perspectives. In the first stage, we tackle the two main problems from the viewpoint of maximising the use of limited training data. Concretely, we propose learning to aggregate embeddings based on a channel-wise attention module. In this stage, we assume the embeddings after feature extraction consists of sufficient useful features. However, a feature extraction process could also lose relevant features. Hence, in the second stage, we target making sure as many useful features as possible can be extracted during a feature extraction process. Specifically, we design a spatial attention-based adaptive pooling module, in which a learnable pooling weight generation block is trained to assign different pooling weights to the features at different spatial locations. To further improve the classification performance, in the third stage, we leverage auxiliary information, such as saliency maps which can highlight the target object in an image, to compensate for the lack of information and mitigate background clutters. A comprehensive exploration of the suitable auxiliary information and how to effectively use it is provided. In summary, the research presented here introduces novel strategies of meta-learning approaches for few-shot image classification, addressing its two core issues from three different perspectives. The conducted works provide insights and solutions about how to effectively overcome the lack of information and intrinsic uncertainties on few-shot image classification. Our proposed methods lead to competitive results on various few-shot learning benchmarks with respect to the state-of-the-art. Besides, they contribute new meta-learning strategies that deal with the two main problems of few-shot image classification simultaneously to the few-shot learning research community
    corecore