14 research outputs found

    A mutual GrabCut method to solve co-segmentation

    Get PDF
    Extent: 11 p.Co-segmentation aims at segmenting common objects from a group of images. Markov random field (MRF) has been widely used to solve co-segmentation, which introduces a global constraint to make the foreground similar to each other. However, it is difficult to minimize the new model. In this paper, we propose a new Markov random field-based co-segmentation model to solve co-segmentation problem without minimization problem. In our model, foreground similarity constraint is added into the unary term of MRF model rather than the global term, which can be minimized by graph cut method. In the model, a new energy function is designed by considering both the foreground similarity and the background consistency. Then, a mutual optimization approach is used to minimize the energy function. We test the proposed method on many pairs of images. The experimental results demonstrate the effectiveness of the proposed method.Zhisheng Gao, Peng Shi, Hamid Reza Karimi and Zheng Pe

    Global optimisation techniques for image segmentation with higher order models

    Get PDF
    Energy minimisation methods are one of the most successful approaches to image segmentation. Typically used energy functions are limited to pairwise interactions due to the increased complexity when working with higher-order functions. However, some important assumptions about objects are not translatable to pairwise interactions. The goal of this thesis is to explore higher order models for segmentation that are applicable to a wide range of objects. We consider: (1) a connectivity constraint, (2) a joint model over the segmentation and the appearance, and (3) a model for segmenting the same object in multiple images. We start by investigating a connectivity prior, which is a natural assumption about objects. We show how this prior can be formulated in the energy minimisation framework and explore the complexity of the underlying optimisation problem, introducing two different algorithms for optimisation. This connectivity prior is useful to overcome the โ€œshrinking biasโ€ of the pairwise model, in particular in interactive segmentation systems. Secondly, we consider an existing model that treats the appearance of the image segments as variables. We show how to globally optimise this model using a Dual Decomposition technique and show that this optimisation method outperforms existing ones. Finally, we explore the current limits of the energy minimisation framework. We consider the cosegmentation task and show that a preference for object-like segmentations is an important addition to cosegmentation. This preference is, however, not easily encoded in the energy minimisation framework. Instead, we use a practical proposal generation approach that allows not only the inclusion of a preference for object-like segmentations, but also to learn the similarity measure needed to define the cosegmentation task. We conclude that higher order models are useful for different object segmentation tasks. We show how some of these models can be formulated in the energy minimisation framework. Furthermore, we introduce global optimisation methods for these energies and make extensive use of the Dual Decomposition optimisation approach that proves to be suitable for this type of models

    ๊ฐ•์ธํ•œ ๋Œ€ํ™”ํ˜• ์˜์ƒ ๋ถ„ํ•  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ์‹œ๋“œ ์ •๋ณด ํ™•์žฅ ๊ธฐ๋ฒ•์— ๋Œ€ํ•œ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ์ด๊ฒฝ๋ฌด.Segmentation of an area corresponding to a desired object in an image is essential to computer vision problems. This is because most algorithms are performed in semantic units when interpreting or analyzing images. However, segmenting the desired object from a given image is an ambiguous issue. The target object varies depending on user and purpose. To solve this problem, an interactive segmentation technique has been proposed. In this approach, segmentation was performed in the desired direction according to interaction with the user. In this case, seed information provided by the user plays an important role. If the seed provided by a user contain abundant information, the accuracy of segmentation increases. However, providing rich seed information places much burden on the users. Therefore, the main goal of the present study was to obtain satisfactory segmentation results using simple seed information. We primarily focused on converting the provided sparse seed information to a rich state so that accurate segmentation results can be derived. To this end, a minimum user input was taken and enriched it through various seed enrichment techniques. A total of three interactive segmentation techniques was proposed based on: (1) Seed Expansion, (2) Seed Generation, (3) Seed Attention. Our seed enriching type comprised expansion of area around a seed, generation of new seed in a new position, and attention to semantic information. First, in seed expansion, we expanded the scope of the seed. We integrated reliable pixels around the initial seed into the seed set through an expansion step composed of two stages. Through the extended seed covering a wider area than the initial seed, the seed's scarcity and imbalance problems was resolved. Next, in seed generation, we created a seed at a new point, but not around the seed. We trained the system by imitating the user behavior through providing a new seed point in the erroneous region. By learning the user's intention, our model could e ciently create a new seed point. The generated seed helped segmentation and could be used as additional information for weakly supervised learning. Finally, through seed attention, we put semantic information in the seed. Unlike the previous models, we integrated both the segmentation process and seed enrichment process. We reinforced the seed information by adding semantic information to the seed instead of spatial expansion. The seed information was enriched through mutual attention with feature maps generated during the segmentation process. The proposed models show superiority compared to the existing techniques through various experiments. To note, even with sparse seed information, our proposed seed enrichment technique gave by far more accurate segmentation results than the other existing methods.์˜์ƒ์—์„œ ์›ํ•˜๋Š” ๋ฌผ์ฒด ์˜์—ญ์„ ์ž˜๋ผ๋‚ด๋Š” ๊ฒƒ์€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ฌธ์ œ์—์„œ ํ•„์ˆ˜์ ์ธ ์š”์†Œ์ด๋‹ค. ์˜์ƒ์„ ํ•ด์„ํ•˜๊ฑฐ๋‚˜ ๋ถ„์„ํ•  ๋•Œ, ๋Œ€๋ถ€๋ถ„์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ์˜๋ฏธ๋ก ์ ์ธ ๋‹จ์œ„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์˜์ƒ์—์„œ ๋ฌผ์ฒด ์˜์—ญ์„ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ์€ ๋ชจํ˜ธํ•œ ๋ฌธ์ œ์ด๋‹ค. ์‚ฌ์šฉ์ž์™€ ๋ชฉ์ ์— ๋”ฐ๋ผ ์›ํ•˜๋Š” ๋ฌผ์ฒด ์˜์—ญ์ด ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ์ž์™€์˜ ๊ต๋ฅ˜๋ฅผ ํ†ตํ•ด ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์˜์ƒ ๋ถ„ํ• ์„ ์ง„ํ–‰ํ•˜๋Š” ๋Œ€ํ™”ํ˜• ์˜์ƒ ๋ถ„ํ•  ๊ธฐ๋ฒ•์ด ์‚ฌ์šฉ๋œ๋‹ค. ์—ฌ๊ธฐ์„œ ์‚ฌ์šฉ์ž๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์‹œ๋“œ ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค. ์‚ฌ์šฉ์ž์˜ ์˜๋„๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ์‹œ๋“œ ์ •๋ณด๊ฐ€ ์ •ํ™•ํ• ์ˆ˜๋ก ์˜์ƒ ๋ถ„ํ• ์˜ ์ •ํ™•๋„๋„ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ’๋ถ€ํ•œ ์‹œ๋“œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์€ ์‚ฌ์šฉ์ž์—๊ฒŒ ๋งŽ์€ ๋ถ€๋‹ด์„ ์ฃผ๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ๊ฐ„๋‹จํ•œ ์‹œ๋“œ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งŒ์กฑํ• ๋งŒํ•œ ๋ถ„ํ•  ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š” ๊ฒƒ์ด ์ฃผ์š” ๋ชฉ์ ์ด ๋œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ œ๊ณต๋œ ํฌ์†Œํ•œ ์‹œ๋“œ ์ •๋ณด๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” ์ž‘์—…์— ์ดˆ์ ์„ ๋‘์—ˆ๋‹ค. ๋งŒ์•ฝ ์‹œ๋“œ ์ •๋ณด๊ฐ€ ํ’๋ถ€ํ•˜๊ฒŒ ๋ณ€ํ™˜๋œ๋‹ค๋ฉด ์ •ํ™•ํ•œ ์˜์ƒ ๋ถ„ํ•  ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์‹œ๋“œ ์ •๋ณด๋ฅผ ํ’๋ถ€ํ•˜๊ฒŒ ํ•˜๋Š” ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ตœ์†Œํ•œ์˜ ์‚ฌ์šฉ์ž ์ž…๋ ฅ์„ ๊ฐ€์ •ํ•˜๊ณ  ์ด๋ฅผ ๋‹ค์–‘ํ•œ ์‹œ๋“œ ํ™•์žฅ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ๋ณ€ํ™˜ํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์‹œ๋“œ ํ™•๋Œ€, ์‹œ๋“œ ์ƒ์„ฑ, ์‹œ๋“œ ์ฃผ์˜ ์ง‘์ค‘์— ๊ธฐ๋ฐ˜ํ•œ ์ด ์„ธ ๊ฐ€์ง€์˜ ๋Œ€ํ™”ํ˜• ์˜์ƒ ๋ถ„ํ•  ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ฐ๊ฐ ์‹œ๋“œ ์ฃผ๋ณ€์œผ๋กœ์˜ ์˜์—ญ ํ™•๋Œ€, ์ƒˆ๋กœ์šด ์ง€์ ์— ์‹œ๋“œ ์ƒ์„ฑ, ์˜๋ฏธ๋ก ์  ์ •๋ณด์— ์ฃผ๋ชฉํ•˜๋Š” ํ˜•ํƒœ์˜ ์‹œ๋“œ ํ™•์žฅ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋จผ์ € ์‹œ๋“œ ํ™•๋Œ€์— ๊ธฐ๋ฐ˜ํ•œ ๊ธฐ๋ฒ•์—์„œ ์šฐ๋ฆฌ๋Š” ์‹œ๋“œ์˜ ์˜์—ญ ํ™•์žฅ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋‘ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ ํ™•๋Œ€ ๊ณผ์ •์„ ํ†ตํ•ด ์ฒ˜์Œ ์‹œ๋“œ ์ฃผ๋ณ€์˜ ๋น„์Šทํ•œ ํ”ฝ์…€๋“ค์„ ์‹œ๋“œ ์˜์—ญ์œผ๋กœ ํŽธ์ž…ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ํ™•์žฅ๋œ ์‹œ๋“œ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์‹œ๋“œ์˜ ํฌ์†Œํ•จ๊ณผ ๋ถˆ๊ท ํ˜•์œผ๋กœ ์ธํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ ์‹œ๋“œ ์ƒ์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ๊ธฐ๋ฒ•์—์„œ ์šฐ๋ฆฌ๋Š” ์‹œ๋“œ ์ฃผ๋ณ€์ด ์•„๋‹Œ ์ƒˆ๋กœ์šด ์ง€์ ์— ์‹œ๋“œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์˜ค์ฐจ๊ฐ€ ๋ฐœ์ƒํ•œ ์˜์—ญ์— ์‚ฌ์šฉ์ž๊ฐ€ ์ƒˆ๋กœ์šด ์‹œ๋“œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋™์ž‘์„ ๋ชจ๋ฐฉํ•˜์—ฌ ์‹œ์Šคํ…œ์„ ํ•™์Šตํ•˜์˜€๋‹ค. ์‚ฌ์šฉ์ž์˜ ์˜๋„๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์จ ํšจ๊ณผ์ ์œผ๋กœ ์‹œ๋“œ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. ์ƒ์„ฑ๋œ ์‹œ๋“œ๋Š” ์˜์ƒ ๋ถ„ํ• ์˜ ์ •ํ™•๋„๋ฅผ ๋†’์ผ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์•ฝ์ง€๋„ํ•™์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋กœ์จ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์‹œ๋“œ ์ฃผ์˜ ์ง‘์ค‘์„ ํ™œ์šฉํ•œ ๊ธฐ๋ฒ•์—์„œ ์šฐ๋ฆฌ๋Š” ์˜๋ฏธ๋ก ์  ์ •๋ณด๋ฅผ ์‹œ๋“œ์— ๋‹ด๋Š”๋‹ค. ๊ธฐ์กด์— ์ œ์•ˆํ•œ ๊ธฐ๋ฒ•๋“ค๊ณผ ๋‹ฌ๋ฆฌ ์˜์ƒ ๋ถ„ํ•  ๋™์ž‘๊ณผ ์‹œ๋“œ ํ™•์žฅ ๋™์ž‘์ด ํ†ตํ•ฉ๋œ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ์‹œ๋“œ ์ •๋ณด๋Š” ์˜์ƒ ๋ถ„ํ•  ๋„คํŠธ์›Œํฌ์˜ ํŠน์ง•๋งต๊ณผ ์ƒํ˜ธ ๊ต๋ฅ˜ํ•˜๋ฉฐ ๊ทธ ์ •๋ณด๊ฐ€ ํ’๋ถ€ํ•ด์ง„๋‹ค. ์ œ์•ˆํ•œ ๋ชจ๋ธ๋“ค์€ ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๊ธฐ์กด ๊ธฐ๋ฒ• ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜์˜€๋‹ค. ํŠนํžˆ ์‹œ๋“œ๊ฐ€ ๋ถ€์กฑํ•œ ์ƒํ™ฉ์—์„œ ์‹œ๋“œ ํ™•์žฅ ๊ธฐ๋ฒ•๋“ค์€ ํ›Œ๋ฅญํ•œ ๋Œ€ํ™”ํ˜• ์˜์ƒ ๋ถ„ํ•  ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.1 Introduction 1 1.1 Previous Works 2 1.2 Proposed Methods 4 2 Interactive Segmentation with Seed Expansion 9 2.1 Introduction 9 2.2 Proposed Method 12 2.2.1 Background 13 2.2.2 Pyramidal RWR 16 2.2.3 Seed Expansion 19 2.2.4 Re nement with Global Information 24 2.3 Experiments 27 2.3.1 Dataset 27 2.3.2 Implement Details 28 2.3.3 Performance 29 2.3.4 Contribution of Each Part 30 2.3.5 Seed Consistency 31 2.3.6 Running Time 33 2.4 Summary 34 3 Interactive Segmentation with Seed Generation 37 3.1 Introduction 37 3.2 Related Works 40 3.3 Proposed Method 41 3.3.1 System Overview 41 3.3.2 Markov Decision Process 42 3.3.3 Deep Q-Network 46 3.3.4 Model Architecture 47 3.4 Experiments 48 3.4.1 Implement Details 48 3.4.2 Performance 49 3.4.3 Ablation Study 53 3.4.4 Other Datasets 55 3.5 Summary 58 4 Interactive Segmentation with Seed Attention 61 4.1 Introduction 61 4.2 Related Works 64 4.3 Proposed Method 65 4.3.1 Interactive Segmentation Network 65 4.3.2 Bi-directional Seed Attention Module 67 4.4 Experiments 70 4.4.1 Datasets 70 4.4.2 Metrics 70 4.4.3 Implement Details 71 4.4.4 Performance 71 4.4.5 Ablation Study 76 4.4.6 Seed enrichment methods 79 4.5 Summary 82 5 Conclusions 87 5.1 Summary 89 Bibliography 90 ๊ตญ๋ฌธ์ดˆ๋ก 103Docto

    CONTENT EXTRACTION BASED ON VIDEO CO-SEGMENTATION

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Visual object category discovery in images and videos

    Get PDF
    textThe current trend in visual recognition research is to place a strict division between the supervised and unsupervised learning paradigms, which is problematic for two main reasons. On the one hand, supervised methods require training data for each and every category that the system learns; training data may not always be available and is expensive to obtain. On the other hand, unsupervised methods must determine the optimal visual cues and distance metrics that distinguish one category from another to group images into semantically meaningful categories; however, for unlabeled data, these are unknown a priori. I propose a visual category discovery framework that transcends the two paradigms and learns accurate models with few labeled exemplars. The main insight is to automatically focus on the prevalent objects in images and videos, and learn models from them for category grouping, segmentation, and summarization. To implement this idea, I first present a context-aware category discovery framework that discovers novel categories by leveraging context from previously learned categories. I devise a novel object-graph descriptor to model the interaction between a set of known categories and the unknown to-be-discovered categories, and group regions that have similar appearance and similar object-graphs. I then present a collective segmentation framework that simultaneously discovers the segmentations and groupings of objects by leveraging the shared patterns in the unlabeled image collection. It discovers an ensemble of representative instances for each unknown category, and builds top-down models from them to refine the segmentation of the remaining instances. Finally, building on these techniques, I show how to produce compact visual summaries for first-person egocentric videos that focus on the important people and objects. The system leverages novel egocentric and high-level saliency features to predict important regions in the video, and produces a concise visual summary that is driven by those regions. I compare against existing state-of-the-art methods for category discovery and segmentation on several challenging benchmark datasets. I demonstrate that we can discover visual concepts more accurately by focusing on the prevalent objects in images and videos, and show clear advantages of departing from the status quo division between the supervised and unsupervised learning paradigms. The main impact of my thesis is that it lays the groundwork for building large-scale visual discovery systems that can automatically discover visual concepts with minimal human supervision.Electrical and Computer Engineerin

    Learning to Complete 3D Scenes from Single Depth Images

    Get PDF
    Building a complete 3D model of a scene given only a single depth image is underconstrained. To acquire a full volumetric model, one typically needs either multiple views, or a single view together with a library of unambiguous 3D models that will fit the shape of each individual object in the scene. In this thesis, we present alternative methods for inferring the hidden geometry of table-top scenes. We first introduce two depth-image datasets consisting of multiple scenes, each with a ground truth voxel occupancy grid. We then introduce three methods for predicting voxel occupancy. The first predicts the occupancy of each voxel using a novel feature vector which measures the relationship between the query voxel and surfaces in the scene observed by the depth camera. We use a Random Forest to map each voxel of unknown state to a prediction of occupancy. We observed that predicting the occupancy of each voxel independently can lead to noisy solutions. We hypothesize that objects of dissimilar semantic classes often share similar 3D shape components, enabling a limited dataset to model the shape of a wide range of objects, and hence estimate their hidden geometry. Demonstrating this hypothesis, we propose an algorithm that can make structured completions of unobserved geometry. Finally, we propose an alternative framework for understanding the 3D geometry of scenes using the observation that individual objects can appear in multiple different scenes, but in different configurations. We introduce a supervised method to find regions corresponding to the same object across different scenes. We demonstrate that it is possible to then use these groupings of partially observed objects to reconstruct missing geometry. We then perform a critical review of the approaches we have taken, including an assessment of our metrics and datasets, before proposing extensions and future work

    DEEP NEURAL NETWORKS AND REGRESSION MODELS FOR OBJECT DETECTION AND POSE ESTIMATION

    Get PDF
    Estimating the pose, orientation and the location of objects has been a central problem addressed by the computer vision community for decades. In this dissertation, we propose new approaches for these important problems using deep neural networks as well as tree-based regression models. For the first topic, we look at the human body pose estimation problem and propose a novel regression-based approach. The goal of human body pose estimation is to predict the locations of body joints, given an image of a person. Due to significant variations introduced by pose, clothing and body styles, it is extremely difficult to address this task by a standard application of the regression method. Thus, we address this task by dividing the whole body pose estimation problem into a set of local pose estimation problems by introducing a dependency graph which describes the dependency among different body joints. For each local pose estimation problem, we train a boosted regression tree model and estimate the pose by progressively applying the regression along the paths in a dependency graph starting from the root node. Our next work is on improving the traditional regression tree method and demonstrate its effectiveness for pose/orientation estimation tasks. The main issues of the traditional regression training are, 1) the node splitting is limited to binary splitting, 2) the form of the splitting function is limited to thresholding on a single dimension of the input vector and 3) the best splitting function is found by exhaustive search. We propose a novel node splitting algorithm for regression tree training which does not have the issues mentioned above. The algorithm proceeds by first applying k-means clustering in the output space, conducting multi-class classification by support vector machine (SVM) and determining the constant estimate at each leaf node. We apply the regression forest that includes our regression tree models to head pose estimation, car orientation estimation and pedestrian orientation estimation tasks and demonstrate its superiority over various standard regression methods. Next, we turn our attention to the role of pose information for the object detection task. In particular, we focus on the detection of fashion items a person is wearing or carrying. It is clear that the locations of these items are strongly correlated with the pose of the person. To address this task, we first generate a set of candidate bounding boxes by using an object proposal algorithm. For each candidate bounding box, image features are extracted by a deep convolutional neural network pre-trained on a large image dataset and the detection scores are generated by SVMs. We introduce a pose-dependent prior on the geometry of the bounding boxes and combine it with the SVM scores. We demonstrate that the proposed algorithm achieves significant improvement in the detection performance. Lastly, we address the object detection task by exploring a way to incorporate an attention mechanism into the detection algorithm. Humans have the capability of allocating multiple fixation points, each of which attends to different locations and scales of the scene. However, such a mechanism is missing in the current state-of-the-art object detection methods. Inspired by the human vision system, we propose a novel deep network architecture that imitates this attention mechanism. For detecting objects in an image, the network adaptively places a sequence of glimpses at different locations in the image. Evidences of the presence of an object and its location are extracted from these glimpses, which are then fused for estimating the object class and bounding box coordinates. Due to the lack of ground truth annotations for the visual attention mechanism, we train our network using a reinforcement learning algorithm. Experiment results on standard object detection benchmarks show that the proposed network consistently outperforms the baseline networks that do not employ the attention mechanism
    corecore