4,844 research outputs found

    PDANet: Pyramid Density-aware Attention Net for Accurate Crowd Counting

    Full text link
    Crowd counting, i.e., estimating the number of people in a crowded area, has attracted much interest in the research community. Although many attempts have been reported, crowd counting remains an open real-world problem due to the vast scale variations in crowd density within the interested area, and severe occlusion among the crowd. In this paper, we propose a novel Pyramid Density-Aware Attention-based network, abbreviated as PDANet, that leverages the attention, pyramid scale feature and two branch decoder modules for density-aware crowd counting. The PDANet utilizes these modules to extract different scale features, focus on the relevant information, and suppress the misleading ones. We also address the variation of crowdedness levels among different images with an exclusive Density-Aware Decoder (DAD). For this purpose, a classifier evaluates the density level of the input features and then passes them to the corresponding high and low crowded DAD modules. Finally, we generate an overall density map by considering the summation of low and high crowded density maps as spatial attention. Meanwhile, we employ two losses to create a precise density map for the input scene. Extensive evaluations conducted on the challenging benchmark datasets well demonstrate the superior performance of the proposed PDANet in terms of the accuracy of counting and generated density maps over the well-known state of the arts

    Learning Segmentation Masks with the Independence Prior

    Full text link
    An instance with a bad mask might make a composite image that uses it look fake. This encourages us to learn segmentation by generating realistic composite images. To achieve this, we propose a novel framework that exploits a new proposed prior called the independence prior based on Generative Adversarial Networks (GANs). The generator produces an image with multiple category-specific instance providers, a layout module and a composition module. Firstly, each provider independently outputs a category-specific instance image with a soft mask. Then the provided instances' poses are corrected by the layout module. Lastly, the composition module combines these instances into a final image. Training with adversarial loss and penalty for mask area, each provider learns a mask that is as small as possible but enough to cover a complete category-specific instance. Weakly supervised semantic segmentation methods widely use grouping cues modeling the association between image parts, which are either artificially designed or learned with costly segmentation labels or only modeled on local pairs. Unlike them, our method automatically models the dependence between any parts and learns instance segmentation. We apply our framework in two cases: (1) Foreground segmentation on category-specific images with box-level annotation. (2) Unsupervised learning of instance appearances and masks with only one image of homogeneous object cluster (HOC). We get appealing results in both tasks, which shows the independence prior is useful for instance segmentation and it is possible to unsupervisedly learn instance masks with only one image.Comment: 7+5 pages, 13 figures, Accepted to AAAI 201

    ๊ตฐ์ค‘ ๋ฐ€๋„ ์˜ˆ์ธก์„ ์œ„ํ•œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์™€ ํ›ˆ๋ จ๋ฐฉ๋ฒ•์˜ ํ˜ผ์žก๋„ ๋ฐ ํฌ๊ธฐ ์ธ์‹ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022.2. ์ตœ์ง„์˜.This dissertation presents novel deep learning-based crowd density estimation methods considering the crowd congestion and scale of people. Crowd density estimation is one of the important tasks for the intelligent surveillance system. Using the crowd density estimation, the region of interest for public security and safety can be easily indicated. It can also help advanced computer vision algorithms that are computationally expensive, such as pedestrian detection and tracking. After the introduction of deep learning to the crowd density estimation, most researches follow the conventional scheme that uses a convolutional neural network to learn the network to estimate crowd density map with training images. The deep learning-based crowd density estimation researches can consist of two perspectives; network structure perspective and training strategy perspective. In general, researches of network structure perspective propose a novel network structure to extract features to represent crowd well. On the other hand, those of the training strategy perspective propose a novel training methodology or a loss function to improve the counting performance. In this dissertation, I propose several works in both perspectives in deep learning-based crowd density estimation. In particular, I design the network models to be had rich crowd representation characteristics according to the crowd congestion and the scale of people. I propose two novel network structures: selective ensemble network and cascade residual dilated network. Also, I propose one novel loss function for the crowd density estimation: congestion-aware Bayesian loss. First, I propose a selective ensemble deep network architecture for crowd density estimation. In contrast to existing deep network-based methods, the proposed method incorporates two sub-networks for local density estimation: one to learn sparse density regions and one to learn dense density regions. Locally estimated density maps from the two sub-networks are selectively combined in an ensemble fashion using a gating network to estimate an initial crowd density map. The initial density map is refined as a high-resolution map, using another sub-network that draws on contextual information in the image. In training, a novel adaptive loss scheme is applied to resolve ambiguity in the crowded region. The proposed scheme improves both density map accuracy and counting accuracy by adjusting the weighting value between density loss and counting loss according to the degree of crowdness and training epochs. Second, I propose a novel crowd density estimation architecture, which is composed of multiple dilated convolutional neural network blocks with different scales. The proposed architecture is motivated by an empirical analysis that small-scale dilated convolution well estimates the center area density of each person, whereas large-scale dilated convolution well estimates the periphery area density of a person. To estimate the crowd density map gradually from the center to the periphery of each person in a crowd, the multiple dilated CNN blocks are trained in cascading from the small dilated CNN block to the large one. Third, I propose a novel congestion-aware Bayesian loss method that considers the person-scale and crowd-sparsity. Deep learning-based crowd density estimation can greatly improve the accuracy of crowd counting. Though a Bayesian loss method resolves the two problems of the need of a hand-crafted ground truth (GT) density and noisy annotations, counting accurately in high-congested scenes remains a challenging issue. In a crowd scene, people's appearances change according to the scale of each individual (i.e., the person-scale). Also, the lower the sparsity of a local region (i.e., the crowd-sparsity), the more difficult it is to estimate the crowd density. I estimate the person-scale based on scene geometry, and I then estimate the crowd-sparsity using the estimated person-scale. The estimated person-scale and crowd-sparsity are utilized in the novel congestion-aware Bayesian loss method to improve the supervising representation of the point annotations. The effectiveness of the proposed density estimators is validated through comparative experiments with state-of-the-art methods on widely-used crowd counting benchmark datasets. The proposed methods are achieved superior performance to the state-of-the-art density estimators on diverse surveillance environments. In addition, for all proposed crowd density estimation methods, the efficiency of each component is verified through several ablation experiments.๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๊ตฐ์ค‘์˜ ํ˜ผ์žก๋„์™€ ์‚ฌ๋žŒ์˜ ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์€ ์ง€๋Šฅํ˜• ๊ฐ์‹œ ์‹œ์Šคํ…œ์˜ ์ค‘์š”ํ•œ ๊ณผ์ œ๋“ค ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ณต ๋ณด์•ˆ ๋ฐ ์•ˆ์ „์— ๋Œ€ํ•œ ๊ด€์‹ฌ ์˜์—ญ์„ ์‰ฝ๊ฒŒ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ณดํ–‰์ž ๊ฐ์ง€, ์ถ”์  ๋“ฑ ์—ฐ์‚ฐ ๋ถ€๋‹ด์ด ๋†’์€ ๊ณ ๊ธ‰ ์ปดํ“จํ„ฐ ๋น„์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ง€๋Šฅํ˜• ๊ฐ์‹œ ์‹œ์Šคํ…œ์— ํšจ๊ณผ์ ์œผ๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋„์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์— ๋”ฅ ๋Ÿฌ๋‹์ด ๋„์ž…๋œ ํ›„ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋Š” ํ›ˆ๋ จ ์ด๋ฏธ์ง€๋กœ ๊ตฐ์ค‘ ๋ฐ€๋„ ๋งต์„ ์ถ”์ •ํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ๊ด€์Šต์ ์ธ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ์—ฐ๊ตฌ๋Š” ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ด€์ ๊ณผ ํ›ˆ๋ จ ์ „๋žต ๊ด€์ ์˜ ๋‘ ๊ฐ€์ง€ ๊ด€์ ์œผ๋กœ ๋‚˜๋‰  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ด€์ ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ตฐ์ค‘์„ ์ž˜ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ํ›ˆ๋ จ ์ „๋žต ๊ด€์ ์—์„œ๋Š” ๊ณ„์ˆ˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋ฐฉ๋ฒ•๋ก ์ด๋‚˜ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ •์—์„œ ๋‘ ๊ฐ€์ง€ ๊ด€์ ์—์„œ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ๊ฐ ์‚ฌ๋žŒ์˜ ๊ตฐ์ค‘ ํ˜ผ์žก๋„์™€ ๊ทœ๋ชจ์— ๋”ฐ๋ผ ํ’๋ถ€ํ•œ ๊ตฐ์ค‘ ํ‘œํ˜„ ํŠน์„ฑ์„ ๊ฐ–๋„๋ก ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค. ์„ ํƒ์  ์•™์ƒ๋ธ” ๋„คํŠธ์›Œํฌ์™€ ๊ณ„๋‹จ์‹ ์ž”์—ฌ ํ™•์žฅ ๋„คํŠธ์›Œํฌ์˜ ๋‘ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์†์‹ค ํ•จ์ˆ˜์ธ ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ €, ์ •ํ™•ํ•œ ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ •๊ณผ ์ธ์› ๊ณ„์ˆ˜๋ฅผ ์œ„ํ•œ ์„ ํƒ์  ์•™์ƒ๋ธ” ๋”ฅ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ๋”ฅ ๋„คํŠธ์›Œํฌ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ์ง€์—ญ ๋ฐ€๋„ ์ถ”์ •์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ํฌ์†Œ ๋ฐ€๋„ ์˜์—ญ ํ•™์Šต์šฉ์ด๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๋ฐ€์ง‘ ๋ฐ€๋„ ์˜์—ญ ํ•™์Šต์šฉ์ž…๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ์—์„œ ์ง€์—ญ์ ์œผ๋กœ ์ถ”์ •๋œ ๋ฐ€๋„๋งต์€ ์ดˆ๊ธฐ ๊ตฐ์ค‘๋ฐ€๋„๋กœ ์ถ”์ •๋˜๋ฉฐ ๊ฒŒ์ดํŒ… ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•™์ƒ๋ธ” ๋ฐฉ์‹์œผ๋กœ ์„ ํƒ์ ์œผ๋กœ ๊ฒฐํ•ฉ๋ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ๋ฐ€๋„๋งต์€ ์ด๋ฏธ์ง€์˜ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ํ•ด์ƒ๋„ ๋งต์œผ๋กœ ๊ฐœ์„ ๋ฉ๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ์—์„œ ์ƒˆ๋กœ์šด ์ ์‘ํ˜• ์†์‹ค ์ฒด๊ณ„๋ฅผ ์ ์šฉํ•˜์—ฌ ํ˜ผ์žกํ•œ ์ง€์—ญ์˜ ๋ชจํ˜ธ์„ฑ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ ๋ฐ€์ง‘๋„ ๋ฐ ํ›ˆ๋ จ ์ •๋„์— ๋”ฐ๋ผ ๋ฐ€๋„ ์†์‹ค๊ณผ ๊ณ„์ˆ˜ ์†์‹ค ์‚ฌ์ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๋ฐ€๋„๋งต ์ •ํ™•๋„์™€ ๊ณ„์ˆ˜ ์ •ํ™•๋„๋ฅผ ๋ชจ๋‘ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šค์ผ€์ผ์ด ๋‹ค๋ฅธ ๋‹ค์ค‘ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ • ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋Š” ์†Œ๊ทœ๋ชจ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์€ ๊ฐ ์‚ฌ๋žŒ์˜ ์ค‘์‹ฌ ์˜์—ญ ๋ฐ€๋„๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ •ํ•˜๋Š” ๋ฐ˜๋ฉด ๋Œ€๊ทœ๋ชจ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์€ ์‚ฌ๋žŒ์˜ ์ฃผ๋ณ€ ์˜์—ญ ๋ฐ€๋„๋ฅผ ์ž˜ ์ถ”์ •ํ•œ๋‹ค๋Š” ๊ฒฝํ—˜์  ๋ถ„์„์—์„œ ๋น„๋กฏ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตฐ์ค‘์— ์žˆ๋Š” ๊ฐ ์‚ฌ๋žŒ์˜ ์ค‘์‹ฌ์—์„œ ์ฃผ๋ณ€์œผ๋กœ ์ ์ฐจ์ ์œผ๋กœ ๊ตฐ์ค‘๋ฐ€๋„๋งต์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ™•์žฅ๋œ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์ด ์ž‘์€ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์—์„œ ํฐ ๋ธ”๋ก์œผ๋กœ ๊ณ„๋‹จ์‹์œผ๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์‚ฌ๋žŒ ๊ทœ๋ชจ์™€ ๊ตฐ์ค‘ ํฌ์†Œ์„ฑ์„ ๊ณ ๋ คํ•œ ์ƒˆ๋กœ์šด ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์€ ๊ตฐ์ค‘ ๊ณ„์‚ฐ์˜ ์ •ํ™•๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์€ ์†์œผ๋กœ ๋งŒ๋“  ์ง€์ƒ ์ง„์‹ค ๋ฐ€๋„์™€ ์žก์Œ์ด ์žˆ๋Š” ์ฃผ์„์˜ ํ•„์š”์„ฑ์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์ง€๋งŒ ํ˜ผ์žกํ•œ ์žฅ๋ฉด์—์„œ ์ •ํ™•ํ•˜๊ฒŒ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ์žฅ๋ฉด์—์„œ ์‚ฌ๋žŒ์˜ ์™ธ๋ชจ๋Š” ๊ฐ ์‚ฌ๋žŒ์˜ ํฌ๊ธฐ('์‚ฌ๋žŒ ํฌ๊ธฐ')์— ๋”ฐ๋ผ ๋ฐ”๋€๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตญ๋ถ€ ์˜์—ญ์˜ ํฌ์†Œ์„ฑ('๊ตฐ์ค‘ ํฌ์†Œ์„ฑ')์ด ๋‚ฎ์„์ˆ˜๋ก ๊ตฐ์ค‘ ๋ฐ€๋„๋ฅผ ์ถ”์ •ํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์žฅ๋ฉด ๊ธฐํ•˜์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ '์‚ฌ๋žŒ ํฌ๊ธฐ'๋ฅผ ์ถ”์ •ํ•œ ๋‹ค์Œ ์ถ”์ •๋œ '์‚ฌ๋žŒ ํฌ๊ธฐ'๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ '๊ตฐ์ค‘ ํฌ์†Œ์„ฑ'์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์ถ”์ •๋œ '์‚ฌ๋žŒ ํฌ๊ธฐ' ๋ฐ '๊ตฐ์ค‘ ํฌ์†Œ์„ฑ'์€ ์ƒˆ๋กœ์šด ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์—์„œ ์‚ฌ์šฉ๋˜์–ด ์  ์ฃผ์„์˜ ๊ต์‚ฌ ํ‘œํ˜„์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐ€๋„ ์ถ”์ •๊ธฐ์˜ ํšจ์œจ์„ฑ์€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๊ตฐ์ค‘ ๊ณ„์‚ฐ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ ๊ฐ์‹œ ํ™˜๊ฒฝ์—์„œ ์ตœ์ฒจ๋‹จ ๋ฐ€๋„ ์ถ”์ •๊ธฐ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ œ์•ˆ๋œ ๋ชจ๋“  ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์ž๊ฐ€๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ํšจ์œจ์„ฑ์„ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 2 Related Works 4 2.1 Detection-based Approaches 4 2.2 Regression-based Approaches 5 2.3 Deep learning-based Approaches 5 2.3.1 Network Structure Perspective 6 2.3.2 Training Strategy Perspective 7 3 Selective Ensemble Network for Accurate Crowd Density Estimation 9 3.1 Overview 9 3.2 Combining Patch-based and Image-based Approaches 11 3.2.1 Local-Global Cascade Network 14 3.2.2 Experiments 20 3.2.3 Summary 24 3.3 Selective Ensemble Network with Adjustable Counting Loss (SEN-ACL) 25 3.3.1 Overall Scheme 25 3.3.2 Data Description 27 3.3.3 Gating Network 27 3.3.4 Sparse / Dense Network 29 3.3.5 Refinement Network 32 3.4 Experiments 34 3.4.1 Implementation Details 34 3.4.2 Dataset and Evaluation Metrics 35 3.4.3 Self-evaluation on WorldExpo'10 dataset 35 3.4.4 Comparative Evaluation with State of the Art Methods 38 3.4.5 Analysis on the Proposed Components 40 3.5 Summary 40 4 Sequential Crowd Density Estimation from Center to Periphery of Crowd 43 4.1 Overview 43 4.2 Cascade Residual Dilated Network (CRDN) 47 4.2.1 Effects of Dilated Convolution in Crowd Counting 47 4.2.2 The Proposed Network 48 4.3 Experiments 52 4.3.1 Datasets and Experimental Settings 52 4.3.2 Implementation Details 52 4.3.3 Comparison with Other Methods 55 4.3.4 Ablation Study 56 4.3.5 Analysis on the Proposed Components 63 4.4 Conclusion 63 5 Congestion-aware Bayesian Loss for Crowd Counting 64 5.1 Overview 64 5.2 Congestion-aware Bayesian Loss 67 5.2.1 Person-Scale Estimation 67 5.2.2 Crowd-Sparsity Estimation 70 5.2.3 Design of The Proposed Loss 70 5.3 Experiments 74 5.3.1 Datasets 76 5.3.2 Implementation Details 77 5.3.3 Evaluation Metrics 77 5.3.4 Ablation Study 78 5.3.5 Comparisons with State of the Art 80 5.3.6 Differences from Existing Person-scale Inference 87 5.3.7 Analysis on the Proposed Components 88 5.4 Summary 90 6 Conclusion 91 Abstract (In Korean) 105๋ฐ•

    Effective Use of Dilated Convolutions for Segmenting Small Object Instances in Remote Sensing Imagery

    Full text link
    Thanks to recent advances in CNNs, solid improvements have been made in semantic segmentation of high resolution remote sensing imagery. However, most of the previous works have not fully taken into account the specific difficulties that exist in remote sensing tasks. One of such difficulties is that objects are small and crowded in remote sensing imagery. To tackle with this challenging task we have proposed a novel architecture called local feature extraction (LFE) module attached on top of dilated front-end module. The LFE module is based on our findings that aggressively increasing dilation factors fails to aggregate local features due to sparsity of the kernel, and detrimental to small objects. The proposed LFE module solves this problem by aggregating local features with decreasing dilation factor. We tested our network on three remote sensing datasets and acquired remarkably good results for all datasets especially for small objects
    • โ€ฆ
    corecore