2,645 research outputs found

    Unsupervised Holistic Image Generation from Key Local Patches

    Full text link
    We introduce a new problem of generating an image based on a small number of key local patches without any geometric prior. In this work, key local patches are defined as informative regions of the target object or scene. This is a challenging problem since it requires generating realistic images and predicting locations of parts at the same time. We construct adversarial networks to tackle this problem. A generator network generates a fake image as well as a mask based on the encoder-decoder framework. On the other hand, a discriminator network aims to detect fake images. The network is trained with three losses to consider spatial, appearance, and adversarial information. The spatial loss determines whether the locations of predicted parts are correct. Input patches are restored in the output image without much modification due to the appearance loss. The adversarial loss ensures output images are realistic. The proposed network is trained without supervisory signals since no labels of key parts are required. Experimental results on six datasets demonstrate that the proposed algorithm performs favorably on challenging objects and scenes.Comment: 16 page

    Generative Face Completion

    Full text link
    In this paper, we propose an effective face completion algorithm using a deep generative model. Different from well-studied background completion, the face completion task is more challenging as it often requires to generate semantically new pixels for the missing key components (e.g., eyes and mouths) that contain large appearance variations. Unlike existing nonparametric algorithms that search for patches to synthesize, our algorithm directly generates contents for missing regions based on a neural network. The model is trained with a combination of a reconstruction loss, two adversarial losses and a semantic parsing loss, which ensures pixel faithfulness and local-global contents consistency. With extensive experimental results, we demonstrate qualitatively and quantitatively that our model is able to deal with a large area of missing pixels in arbitrary shapes and generate realistic face completion results.Comment: Accepted by CVPR 201

    Exploiting Deep Features for Remote Sensing Image Retrieval: A Systematic Investigation

    Full text link
    Remote sensing (RS) image retrieval is of great significant for geological information mining. Over the past two decades, a large amount of research on this task has been carried out, which mainly focuses on the following three core issues: feature extraction, similarity metric and relevance feedback. Due to the complexity and multiformity of ground objects in high-resolution remote sensing (HRRS) images, there is still room for improvement in the current retrieval approaches. In this paper, we analyze the three core issues of RS image retrieval and provide a comprehensive review on existing methods. Furthermore, for the goal to advance the state-of-the-art in HRRS image retrieval, we focus on the feature extraction issue and delve how to use powerful deep representations to address this task. We conduct systematic investigation on evaluating correlative factors that may affect the performance of deep features. By optimizing each factor, we acquire remarkable retrieval results on publicly available HRRS datasets. Finally, we explain the experimental phenomenon in detail and draw conclusions according to our analysis. Our work can serve as a guiding role for the research of content-based RS image retrieval

    ๋ถ€๋ถ„ ์ •๋ณด๋ฅผ ์ด์šฉํ•œ ์‹œ๊ฐ ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐํ™” ๋œ ์ดํ•ด: ํฌ์†Œ์„ฑ, ๋ฌด์ž‘์œ„์„ฑ, ์—ฐ๊ด€์„ฑ, ๊ทธ๋ฆฌ๊ณ  ๋”ฅ ๋„คํŠธ์›Œํฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. Oh, Songhwai.For a deeper understanding of visual data, a relationship between local parts and a global scene has to be carefully examined. Examples of such relationships related to vision problems include but not limited to detecting a region of interest in the scene, classifying an image based on limited visual cues, and synthesizing new images conditioned on the local or global inputs. In this thesis, we aim to learn the relationship and demonstrate its importance by showing that it is one of critical keys to address four challenging vision problems mentioned above. For each problem, we construct deep neural networks that suit for each task. The first problem considered in the thesis is object detection. It requires not only finding local patches that look like target objects conditioned on the context of input scene but also comparing local patches themselves to assign a single detection for each object. To this end, we introduce individualness of detection candidates as a complement to objectness for object detection. The individualness assigns a single detection for each object out of raw detection candidates given by either object proposals or sliding windows. We show that conventional approaches, such as non-maximum suppression, are sub-optimal since they suppress nearby detections using only detection scores. We use a determinantal point process combined with the individualness to optimally select final detections. It models each detection using its quality and similarity to other detections based on the individualness. Then, detections with high detection scores and low correlations are selected by measuring their probability using a determinant of a matrix, which is composed of quality terms on the diagonal entries and similarities on the off-diagonal entries. For concreteness, we focus on the pedestrian detection problem as it is one of the most challenging problems due to frequent occlusions and unpredictable human motions. Experimental results demonstrate that the proposed algorithm works favorably against existing methods, including non-maximal suppression and a quadratic unconstrained binary optimization based method. For a second problem, we classify images based on observations of local patches. More specifically, we consider the problem of estimating the head pose and body orientation of a person from a low-resolution image. Under this setting, it is difficult to reliably extract facial features or detect body parts. We propose a convolutional random projection forest (CRPforest) algorithm for these tasks. A convolutional random projection network (CRPnet) is used at each node of the forest. It maps an input image to a high-dimensional feature space using a rich filter bank. The filter bank is designed to generate sparse responses so that they can be efficiently computed by compressive sensing. A sparse random projection matrix can capture most essential information contained in the filter bank without using all the filters in it. Therefore, the CRPnet is fast, e.g., it requires 0.04ms to process an image of 50ร—50 pixels, due to the small number of convolutions (e.g., 0.01% of a layer of a neural network) at the expense of less than 2% accuracy. The overall forest estimates head and body pose well on benchmark datasets, e.g., over 98% on the HIIT dataset, while requiring at 3.8ms without using a GPU. Extensive experiments on challenging datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in low-resolution images with noise, occlusion, and motion blur. Then, we shift our attention to image synthesis based on the local-global relationship. Learning how to synthesize and place object instances into an image (semantic map) based on the scene context is a challenging and interesting problem in vision and learning. On one hand, solving this problem requires a joint decision of (a) generating an object mask from a certain class at a plausible scale, location, and shape, and (b) inserting the object instance mask into an existing scene so that the synthesized content is semantically realistic. On the other hand, such a model can synthesize realistic outputs to potentially facilitate numerous image editing and scene parsing tasks. In this paper, we propose an end-to-end trainable neural network that can synthesize and insert object instances into an image via a semantic map. The proposed network contains two generative modules that determine where the inserted object should be (i.e., location and scale) and what the object shape (and pose) should look like. The two modules are connected together with a spatial transformation network and jointly trained and optimized in a purely data-driven way. Specifically, we propose a novel network architecture with parallel supervised and unsupervised paths to guarantee diverse results. We show that the proposed network architecture learns the context-aware distribution of the location and shape of object instances to be inserted, and it can generate realistic and statistically meaningful object instances that simultaneously address the where and what sub-problems. As the final topic of the thesis, we introduce a new vision problem: generating an image based on a small number of key local patches without any geometric prior. In this work, key local patches are defined as informative regions of the target object or scene. This is a challenging problem since it requires generating realistic images and predicting locations of parts at the same time. We construct adversarial networks to tackle this problem. A generator network generates a fake image as well as a mask based on the encoder-decoder framework. On the other hand, a discriminator network aims to detect fake images. The network is trained with three losses to consider spatial, appearance, and adversarial information. The spatial loss determines whether the locations of predicted parts are correct. Input patches are restored in the output image without much modification due to the appearance loss. The adversarial loss ensures output images are realistic. The proposed network is trained without supervisory signals since no labels of key parts are required. Experimental results on seven datasets demonstrate that the proposed algorithm performs favorably on challenging objects and scenes.์‹œ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์‹ฌ๋„ ๊นŠ๊ฒŒ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ „์ฒด ์˜์—ญ๊ณผ ๋ถ€๋ถ„ ์˜์—ญ๋“ค ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ ํ˜น์€ ์ƒํ˜ธ ์ž‘์šฉ์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. ์ด์— ๊ด€๋ จ๋œ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ฌธ์ œ๋กœ๋Š” ์ด๋ฏธ์ง€์—์„œ ์›ํ•˜๋Š” ๋ถ€๋ถ„์„ ๊ฒ€์ถœํ•œ๋‹ค๋˜์ง€, ์ œํ•œ๋œ ๋ถ€๋ถ„์ ์ธ ์ •๋ณด๋งŒ์œผ๋กœ ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํŒ๋ณ„ ํ•˜๊ฑฐ๋‚˜, ํ˜น์€ ์ฃผ์–ด์ง„ ์ •๋ณด๋กœ๋ถ€ํ„ฐ ์›ํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋“ฑ์ด ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š”, ๊ทธ ์—ฐ๊ด€์„ฑ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•ž์„œ ์–ธ๊ธ‰๋œ ๋‹ค์–‘ํ•œ ๋ฌธ์ œ๋“ค์„ ํ‘ธ๋Š”๋ฐ ์ค‘์š”ํ•œ ์—ด์‡ ๊ฐ€ ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ณ ์ž ํ•œ๋‹ค. ์ด์— ๋”ํ•ด์„œ, ๊ฐ๊ฐ์˜ ๋ฌธ์ œ์— ์•Œ๋งž๋Š” ๋”ฅ ๋„คํŠธ์›Œํฌ์˜ ๋””์ž์ธ ๋˜ํ•œ ํ† ์˜ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ฒซ ์ฃผ์ œ๋กœ, ๋ฌผ์ฒด ๊ฒ€์ถœ ๋ฐฉ์‹์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ด ๋ฌธ์ œ๋Š” ํƒ€๊ฒŸ ๋ฌผ์ฒด์™€ ๋น„์Šทํ•˜๊ฒŒ ์ƒ๊ธด ์˜์—ญ์„ ์ฐพ์•„์•ผ ํ•  ๋ฟ ์•„๋‹ˆ๋ผ, ์ฐพ์•„์ง„ ์˜์—ญ๋“ค ์‚ฌ์ด์— ์—ฐ๊ด€์„ฑ์„ ๋ถ„์„ํ•จ์œผ๋กœ์จ ๊ฐ ๋ฌผ์ฒด ๋งˆ๋‹ค ๋‹จ ํ•˜๋‚˜์˜ ๊ฒ€์ถœ ๊ฒฐ๊ณผ๋ฅผ ํ• ๋‹น์‹œ์ผœ์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” objectness์— ๋Œ€ํ•œ ๋ณด์™„์œผ๋กœ์จ individualness๋ผ๋Š” ๊ฐœ๋…์„ ์ œ์•ˆ ํ•˜์˜€๋‹ค. ์ด๋Š” ์ž„์˜์˜ ๋ฐฉ์‹์œผ๋กœ ์–ป์–ด์ง„ ํ›„๋ณด ๋ฌผ์ฒด ์˜์—ญ ์ค‘ ํ•˜๋‚˜์”ฉ์„ ๋ฌผ์ฒด ๋งˆ๋‹ค ํ• ๋‹นํ•˜๋Š”๋ฐ ์“ฐ์ด๋Š”๋ฐ, ์ด๊ฒƒ์€ ๊ฒ€์ถœ ์Šค์ฝ”์–ด๋งŒ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ›„์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š” ๊ธฐ์กด์˜ non-maximum suppression ๋“ฑ์˜ ๋ฐฉ์‹์ด sub-optimal ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ๋ฐ–์— ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ณ ์ž ๋„์ž…ํ•˜์˜€๋‹ค. ์šฐ๋ฆฌ๋Š” ํ›„๋ณด ๋ฌผ์ฒด ์˜์—ญ์œผ๋กœ๋ถ€ํ„ฐ ์ตœ์ ์˜ ์˜์—ญ๋“ค์„ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•ด์„œ, determinantal point process๋ผ๋Š” random process์˜ ์ผ์ข…์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด๊ฒƒ์€ ๋จผ์ € ๊ฐ๊ฐ์˜ ๊ฒ€์ถœ ๊ฒฐ๊ณผ๋ฅผ ๊ทธ๊ฒƒ์˜ quality(๊ฒ€์ถœ ์Šค์ฝ”์–ด)์™€ ๋‹ค๋ฅธ ๊ฒ€์ถœ ๊ฒฐ๊ณผ๋“ค ์‚ฌ์ด์— individualness๋ฅผ ๋ฐ”ํƒ•์œผ ๋กœ ๊ณ„์‚ฐ๋œ similarity(์ƒ๊ด€ ๊ด€๊ณ„)๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ๋ง ํ•œ๋‹ค. ๊ทธ ํ›„, ๊ฐ๊ฐ์˜ ๊ฒ€์ถœ ๊ฒฐ๊ณผ๊ฐ€ ์„ ํƒ๋  ํ™•๋ฅ ์„ quality์™€ similarity์— ๊ธฐ๋ฐ˜ํ•œ ์ปค๋„์˜ determinant๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ๊ทธ ์ปค๋„์— diagonal ๋ถ€๋ถ„์—๋Š” quality๊ฐ€ ๋“ค์–ด๊ฐ€๊ณ , off-diagonal์—๋Š” similarity๊ฐ€ ๋Œ€์ž… ๋œ๋‹ค. ๋”ฐ๋ผ์„œ, ์–ด๋–ค ๊ฒ€์ถœ ํ›„๋ณด๊ฐ€ ์ตœ์ข… ๊ฒ€์ถœ ๊ฒฐ๊ณผ๋กœ ์„ ํƒ๋  ํ™•๋ฅ ์ด ๋†’์•„์ง€๊ธฐ ์œ„ํ•ด์„œ๋Š”, ๋†’์€ quality๋ฅผ ๊ฐ€์ง๊ณผ ๋™์‹œ์— ๋‹ค๋ฅธ ๊ฒ€์ถœ ๊ฒฐ๊ณผ๋“ค๊ณผ ๋‚ฎ์€ similarity๋ฅผ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณดํ–‰์ž ๊ฒ€์ถœ์— ์ง‘์ค‘ํ•˜์˜€๋Š”๋ฐ, ์ด๋Š” ๋ณดํ–‰์ž ๊ฒ€์ถœ์ด ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋ฉด์„œ๋„, ๋‹ค๋ฅธ ๋ฌผ์ฒด๋“ค์— ๋น„ํ•ด ์ž์ฃผ ๊ฐ€๋ ค์ง€๊ณ  ๋‹ค์–‘ํ•œ ์›€์ง์ž„์„ ๋ณด์ด๋Š” ๊ฒ€์ถœ์ด ์–ด๋ ค์šด ๋ฌผ์ฒด์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์ด non-maximum suppression ํ˜น์€ quadratic unconstrained binary optimization ๋ฐฉ์‹๋“ค ๋ณด๋‹ค ์šฐ์ˆ˜ํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋‹ค์Œ ๋ฌธ์ œ๋กœ๋Š”, ๋ถ€๋ถ„ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด์„œ ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ classifyํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•œ๋‹ค. ๋‹ค์–‘ํ•œ classification ๋ฌธ์ œ ์ค‘์—, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ €ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์‚ฌ๋žŒ์˜ ๋จธ๋ฆฌ์™€ ๋ชธ์ด ํ–ฅํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์•Œ์•„๋‚ด๋Š” ๋ฌธ์ œ์— ์ง‘์ค‘ํ•˜์˜€๋‹ค. ์ด ๊ฒฝ์šฐ์—๋Š”, ๋ˆˆ, ์ฝ”, ์ž… ๋“ฑ์„ ์ฐพ๊ฑฐ๋‚˜, ๋ชธ์˜ ํŒŒํŠธ๋ฅผ ์ •ํ™•ํžˆ ์•Œ์•„๋‚ด๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” convolutional random projection forest (CRPforest)๋ผ๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด forest์— ๊ฐ๊ฐ์˜ node ์•ˆ์—๋Š” convolutional random projection network (CRPnet)์ด ๋“ค์–ด์žˆ๋Š”๋ฐ, ์ด๋Š” ๋‹ค์–‘ํ•œ ํ•„ํ„ฐ๋ฅผ ์ด์šฉํ•ด์„œ ์ธํ’‹ ์ด๋ฏธ์ง€๋ฅผ ๋†’์€ ์ฐจ์›์œผ๋กœ mapping ํ•œ๋‹ค. ์ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด sparseํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ํ•„ํ„ฐ๋“ค์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, ์••์ถ• ์„ผ์‹ฑ ๊ฐœ๋…์„ ๋„์ž… ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ์ฆ‰, ์‹ค์ œ๋กœ๋Š” ์ ์€ ์ˆ˜์˜ ํ•„ํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•ด์„œ ์ „์ฒด ์ด๋ฏธ์ง€์˜ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ๋‹ด๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ CRPnet์€ 50ร—50 ํ”ฝ์…€ ์ด๋ฏธ์ง€์—์„œ 0.04ms ๋งŒ์— ๋™์ž‘ ํ•  ์ˆ˜ ์žˆ์„ ์ •๋„๋กœ ๋งค์šฐ ๋น ๋ฅด๋ฉฐ, ๋™์‹œ์— ์„ฑ๋Šฅ ํ•˜๋ฝ์€ 2% ์ •๋„๋กœ ๋ฏธ๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ์ „์ฒด forest๋Š” GPU ์—†์ด 3.8ms ์•ˆ์— ๋™์ž‘ํ•˜๋ฉฐ, ๋จธ๋ฆฌ์™€ ๋ชธํ†ต ๋ฐฉํ–ฅ ์ธก์ •์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋˜ํ•œ, ์ €ํ•ด์ƒ๋„, ๋…ธ์ด์ฆˆ, ๊ฐ€๋ ค์ง, ๋ธ”๋Ÿฌ ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๊ฒฝ์šฐ์—๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ ๋ถ€๋ถ„-์ „์ฒด์˜ ์—ฐ๊ด€์„ฑ์„ ํ†ตํ•œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ฌธ์ œ๋ฅผ ํƒ๊ตฌํ•œ๋‹ค. ์ž…๋ ฅ ์ด๋ฏธ์ง€ ์ƒ์— ์–ด๋–ค ๋ฌผ์ฒด๋ฅผ ์–ด๋–ป๊ฒŒ ๋†“์„ ๊ฒƒ์ธ์ง€๋ฅผ ์œ ์ถ”ํ•˜๋Š” ๊ฒƒ์€ ์ปดํ“จํ„ฐ ๋น„์ „๊ณผ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ์ž…์žฅ์—์„œ ์•„์ฃผ ํฅ๋ฏธ๋กœ์šด ๋ฌธ์ œ์ด๋‹ค. ์ด๋Š” ๋จผ์ €, ๋ฌผ์ฒด์˜ ๋งˆ์Šคํฌ๋ฅผ ์ ์ ˆํ•œ ํฌ๊ธฐ, ์œ„์น˜, ๋ชจ์–‘์œผ๋กœ ๋งŒ๋“ค๋ฉด์„œ ๋™์‹œ์— ๊ทธ ๋ฌผ์ฒด๊ฐ€ ์ž…๋ ฅ ์ด๋ฏธ์ง€ ์ƒ์— ๋†“์—ฌ์กŒ์„ ๋•Œ์—๋„ ํ•ฉ๋ฆฌ์ ์œผ๋กœ ๋ณด์ผ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋œ๋‹ค๋ฉด, image editing ํ˜น์€ scene parsing ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ฌธ์ œ์— ์‘์šฉ ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š”, ์ž…๋ ฅ semantic map์œผ๋กœ ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด ๋ฌผ์ฒด๋ฅผ ์•Œ๋งž์€ ๊ณณ์— ๋†“๋Š” ๋ฌธ์ œ๋ฅผ end-to-end ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋”ฅ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, where ๋ชจ๋“ˆ๊ณผ what ๋ชจ๋“ˆ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์„ฑํ•˜์˜€์œผ๋ฉฐ, ๋‘ ๋ชจ๋“ˆ์„ spatial transformer network์„ ํ†ตํ•ด ์—ฐ๊ฒฐํ•˜์—ฌ ๋™์‹œ์— ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ๊ฐ๊ฐ์˜ ๋ชจ๋“ˆ์— ์ง€๋„์  ํ•™์Šต ๊ฒฝ๋กœ์™€ ๋น„์ง€๋„์  ํ•™์Šต ๊ฒฝ๋กœ๋ฅผ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๋ฐฐ์น˜ํ•˜์—ฌ ๋™์ผํ•œ ์ž…๋ ฅ์œผ๋กœ ๋ถ€ํ„ฐ ๋‹ค์–‘ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด, ์ œ์•ˆํ•œ ๋ฐฉ์‹์ด ์‚ฝ์ž…๋  ๋ฌผ์ฒด์˜ ์œ„์น˜์™€ ๋ชจ์–‘์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ๋™์‹œ์— ํ•™์Šต ํ•  ์ˆ˜ ์žˆ๊ณ , ๊ทธ ๋ถ„ํฌ๋กœ๋ถ€ํ„ฐ ์‹ค์ œ์™€ ์œ ์‚ฌํ•œ ๋ฌผ์ฒด๋ฅผ ์•Œ๋งž์€ ๊ณณ์— ๋†“์„ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ณ ๋ คํ•  ๋ฌธ์ œ๋Š”, ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์— ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋กœ์จ, ์œ„์น˜ ์ •๋ณด๊ฐ€ ์ƒ์‹ค ๋œ ์ ์€ ์ˆ˜์˜ ๋ถ€๋ถ„ ํŒจ์น˜๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๋ณต์›ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์€ ์ด๋ฏธ์ง€ ์ƒ์„ฑ๊ณผ ๋™์‹œ์— ๊ฐ ํŒจ์น˜์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ถ”์ธกํ•ด์•ผ ํ•˜๊ธฐ์— ์–ด๋ ค์šด ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ ๋Œ€์  ๋„คํŠธ์›Œํฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ์ฆ‰, ์ƒ์„ฑ ๋„คํŠธ์›Œํฌ๋Š” encoder-decoder ๋ฐฉ์‹์„ ์ด์šฉํ•ด์„œ ์ด๋ฏธ์ง€์™€ ์œ„์น˜ ๋งˆ์Šคํฌ๋ฅผ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๋ฐ˜๋ฉด์—, ํŒ๋ณ„ ๋„คํŠธ์›Œํฌ๋Š” ์ƒ์„ฑ๋œ ๊ฐ€์งœ ์ด๋ฏธ์ง€๋ฅผ ์ฐพ์œผ๋ ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ „์ฒด ๋„คํŠธ์›Œํฌ๋Š” ์œ„์น˜, ๊ฒ‰๋ณด๊ธฐ, ์ ๋Œ€์  ๊ฒฝ์Ÿ์˜ ์„ธ ๊ฐ€์ง€ ๋ชฉ์  ํ•จ์ˆ˜๋“ค๋กœ ํ•™์Šต์ด ๋œ๋‹ค. ์œ„์น˜ ๋ชฉ์  ํ•จ์ˆ˜๋Š” ์•Œ๋งž์€ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜์—ˆ๊ณ , ๊ฒ‰๋ณด๊ธฐ ๋ชฉ์  ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ ํŒจ์น˜ ๋“ค์ด ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€ ์ƒ์— ์ ์€ ๋ณ€ํ™”๋งŒ์„ ๊ฐ€์ง€๊ณ  ๋‚จ์•„์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ์ ๋Œ€์  ๊ฒฝ์Ÿ ๋ชฉ์  ํ•จ์ˆ˜๋Š” ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๊ฐ€ ์‹ค์ œ ์ด๋ฏธ์ง€์™€ ๋น„์Šทํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ์ ์šฉ๋˜์—ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌ์„ฑ๋œ ๋„คํŠธ์›Œํฌ๋Š” ๋ณ„๋„์˜ annotation ์—†์ด ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹ ๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ์žฅ์ ์ด ์žˆ๋‹ค. ๋˜ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด, ์ œ์•ˆํ•œ ๋ฐฉ์‹์ด ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ž˜ ๋™์ž‘ํ•จ์„ ๋ณด์˜€๋‹ค.1 Introduction 1 1.1 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 5 2 Related Work 9 2.1 Detection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Orientation estimation methods . . . . . . . . . . . . . . . . . . . . 11 2.3 Instance synthesis methods . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Image generation methods . . . . . . . . . . . . . . . . . . . . . . . 15 3 Pedestrian detection 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Determinantal Point Process Formulation . . . . . . . . . . 22 3.2.2 Quality Term . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.3 Individualness and Diversity Feature . . . . . . . . . . . . . 25 3.2.4 Mode Finding . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.5 Relationship to Quadratic Unconstrained Binary Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . 36 3.3.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 DET curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.5 Effectiveness of the quality and similarity term design . . . 44 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Head and body orientation estimation 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Algorithmic Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Rich Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Compressed Filter Bank . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Box Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Convolutional Random Projection Net . . . . . . . . . . . . . . . . 58 4.4.1 Input Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 Convolutional and ReLU Layers . . . . . . . . . . . . . . . 60 4.4.3 Random Projection Layer . . . . . . . . . . . . . . . . . . . 61 4.4.4 Fully-Connected and Output Layers . . . . . . . . . . . . . 62 4.5 Convolutional Random Projection Forest . . . . . . . . . . . . . . 62 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6.1 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . 65 4.6.2 CRPnet Characteristics . . . . . . . . . . . . . . . . . . . . 66 4.6.3 Head and Body Orientation Estimation . . . . . . . . . . . 67 4.6.4 Analysis of the Proposed Algorithm . . . . . . . . . . . . . 87 4.6.5 Classification Examples . . . . . . . . . . . . . . . . . . . . 87 4.6.6 Regression Examples . . . . . . . . . . . . . . . . . . . . . . 100 4.6.7 Experiments on the Original Datasets . . . . . . . . . . . . 100 4.6.8 Dataset Corrections . . . . . . . . . . . . . . . . . . . . . . 100 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 Instance synthesis and placement 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.1 The where module: learning a spatial distribution of object instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2.2 The what module: learning a shape distribution of object instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.3 The complete pipeline . . . . . . . . . . . . . . . . . . . . . 120 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6 Image generation 129 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2.1 Key Part Detection . . . . . . . . . . . . . . . . . . . . . . 135 6.2.2 Part Encoding Network . . . . . . . . . . . . . . . . . . . . 135 6.2.3 Mask Prediction Network . . . . . . . . . . . . . . . . . . . 137 6.2.4 Image Generation Network . . . . . . . . . . . . . . . . . . 138 6.2.5 Real-Fake Discriminator Network . . . . . . . . . . . . . . . 139 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.3.2 Image Generation Results . . . . . . . . . . . . . . . . . . . 142 6.3.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . 150 6.3.4 Image Generation from Local Patches . . . . . . . . . . . . 150 6.3.5 Part Combination . . . . . . . . . . . . . . . . . . . . . . . 150 6.3.6 Unsupervised Feature Learning . . . . . . . . . . . . . . . . 151 6.3.7 An Alternative Objective Function . . . . . . . . . . . . . . 151 6.3.8 An Alternative Network Structure . . . . . . . . . . . . . . 151 6.3.9 Different Number of Input Patches . . . . . . . . . . . . . . 152 6.3.10 Smaller Size of Input Patches . . . . . . . . . . . . . . . . . 153 6.3.11 Degraded Input Patches . . . . . . . . . . . . . . . . . . . . 153 6.3.12 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.3.13 Failure cases . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7 Conclusion and Future Work 179Docto

    A Survey on Deep Learning in Medical Image Analysis

    Full text link
    Deep learning algorithms, in particular convolutional networks, have rapidly become a methodology of choice for analyzing medical images. This paper reviews the major deep learning concepts pertinent to medical image analysis and summarizes over 300 contributions to the field, most of which appeared in the last year. We survey the use of deep learning for image classification, object detection, segmentation, registration, and other tasks and provide concise overviews of studies per application area. Open challenges and directions for future research are discussed.Comment: Revised survey includes expanded discussion section and reworked introductory section on common deep architectures. Added missed papers from before Feb 1st 201
    • โ€ฆ
    corecore