472 research outputs found

    ๊ตฐ์ค‘ ๋ฐ€๋„ ์˜ˆ์ธก์„ ์œ„ํ•œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์™€ ํ›ˆ๋ จ๋ฐฉ๋ฒ•์˜ ํ˜ผ์žก๋„ ๋ฐ ํฌ๊ธฐ ์ธ์‹ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022.2. ์ตœ์ง„์˜.This dissertation presents novel deep learning-based crowd density estimation methods considering the crowd congestion and scale of people. Crowd density estimation is one of the important tasks for the intelligent surveillance system. Using the crowd density estimation, the region of interest for public security and safety can be easily indicated. It can also help advanced computer vision algorithms that are computationally expensive, such as pedestrian detection and tracking. After the introduction of deep learning to the crowd density estimation, most researches follow the conventional scheme that uses a convolutional neural network to learn the network to estimate crowd density map with training images. The deep learning-based crowd density estimation researches can consist of two perspectives; network structure perspective and training strategy perspective. In general, researches of network structure perspective propose a novel network structure to extract features to represent crowd well. On the other hand, those of the training strategy perspective propose a novel training methodology or a loss function to improve the counting performance. In this dissertation, I propose several works in both perspectives in deep learning-based crowd density estimation. In particular, I design the network models to be had rich crowd representation characteristics according to the crowd congestion and the scale of people. I propose two novel network structures: selective ensemble network and cascade residual dilated network. Also, I propose one novel loss function for the crowd density estimation: congestion-aware Bayesian loss. First, I propose a selective ensemble deep network architecture for crowd density estimation. In contrast to existing deep network-based methods, the proposed method incorporates two sub-networks for local density estimation: one to learn sparse density regions and one to learn dense density regions. Locally estimated density maps from the two sub-networks are selectively combined in an ensemble fashion using a gating network to estimate an initial crowd density map. The initial density map is refined as a high-resolution map, using another sub-network that draws on contextual information in the image. In training, a novel adaptive loss scheme is applied to resolve ambiguity in the crowded region. The proposed scheme improves both density map accuracy and counting accuracy by adjusting the weighting value between density loss and counting loss according to the degree of crowdness and training epochs. Second, I propose a novel crowd density estimation architecture, which is composed of multiple dilated convolutional neural network blocks with different scales. The proposed architecture is motivated by an empirical analysis that small-scale dilated convolution well estimates the center area density of each person, whereas large-scale dilated convolution well estimates the periphery area density of a person. To estimate the crowd density map gradually from the center to the periphery of each person in a crowd, the multiple dilated CNN blocks are trained in cascading from the small dilated CNN block to the large one. Third, I propose a novel congestion-aware Bayesian loss method that considers the person-scale and crowd-sparsity. Deep learning-based crowd density estimation can greatly improve the accuracy of crowd counting. Though a Bayesian loss method resolves the two problems of the need of a hand-crafted ground truth (GT) density and noisy annotations, counting accurately in high-congested scenes remains a challenging issue. In a crowd scene, people's appearances change according to the scale of each individual (i.e., the person-scale). Also, the lower the sparsity of a local region (i.e., the crowd-sparsity), the more difficult it is to estimate the crowd density. I estimate the person-scale based on scene geometry, and I then estimate the crowd-sparsity using the estimated person-scale. The estimated person-scale and crowd-sparsity are utilized in the novel congestion-aware Bayesian loss method to improve the supervising representation of the point annotations. The effectiveness of the proposed density estimators is validated through comparative experiments with state-of-the-art methods on widely-used crowd counting benchmark datasets. The proposed methods are achieved superior performance to the state-of-the-art density estimators on diverse surveillance environments. In addition, for all proposed crowd density estimation methods, the efficiency of each component is verified through several ablation experiments.๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๊ตฐ์ค‘์˜ ํ˜ผ์žก๋„์™€ ์‚ฌ๋žŒ์˜ ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์€ ์ง€๋Šฅํ˜• ๊ฐ์‹œ ์‹œ์Šคํ…œ์˜ ์ค‘์š”ํ•œ ๊ณผ์ œ๋“ค ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ณต ๋ณด์•ˆ ๋ฐ ์•ˆ์ „์— ๋Œ€ํ•œ ๊ด€์‹ฌ ์˜์—ญ์„ ์‰ฝ๊ฒŒ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ณดํ–‰์ž ๊ฐ์ง€, ์ถ”์  ๋“ฑ ์—ฐ์‚ฐ ๋ถ€๋‹ด์ด ๋†’์€ ๊ณ ๊ธ‰ ์ปดํ“จํ„ฐ ๋น„์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ง€๋Šฅํ˜• ๊ฐ์‹œ ์‹œ์Šคํ…œ์— ํšจ๊ณผ์ ์œผ๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋„์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์— ๋”ฅ ๋Ÿฌ๋‹์ด ๋„์ž…๋œ ํ›„ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋Š” ํ›ˆ๋ จ ์ด๋ฏธ์ง€๋กœ ๊ตฐ์ค‘ ๋ฐ€๋„ ๋งต์„ ์ถ”์ •ํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ๊ด€์Šต์ ์ธ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ์—ฐ๊ตฌ๋Š” ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ด€์ ๊ณผ ํ›ˆ๋ จ ์ „๋žต ๊ด€์ ์˜ ๋‘ ๊ฐ€์ง€ ๊ด€์ ์œผ๋กœ ๋‚˜๋‰  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ด€์ ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ตฐ์ค‘์„ ์ž˜ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ํ›ˆ๋ จ ์ „๋žต ๊ด€์ ์—์„œ๋Š” ๊ณ„์ˆ˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋ฐฉ๋ฒ•๋ก ์ด๋‚˜ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ •์—์„œ ๋‘ ๊ฐ€์ง€ ๊ด€์ ์—์„œ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ๊ฐ ์‚ฌ๋žŒ์˜ ๊ตฐ์ค‘ ํ˜ผ์žก๋„์™€ ๊ทœ๋ชจ์— ๋”ฐ๋ผ ํ’๋ถ€ํ•œ ๊ตฐ์ค‘ ํ‘œํ˜„ ํŠน์„ฑ์„ ๊ฐ–๋„๋ก ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค. ์„ ํƒ์  ์•™์ƒ๋ธ” ๋„คํŠธ์›Œํฌ์™€ ๊ณ„๋‹จ์‹ ์ž”์—ฌ ํ™•์žฅ ๋„คํŠธ์›Œํฌ์˜ ๋‘ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์†์‹ค ํ•จ์ˆ˜์ธ ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ €, ์ •ํ™•ํ•œ ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ •๊ณผ ์ธ์› ๊ณ„์ˆ˜๋ฅผ ์œ„ํ•œ ์„ ํƒ์  ์•™์ƒ๋ธ” ๋”ฅ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ๋”ฅ ๋„คํŠธ์›Œํฌ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ์ง€์—ญ ๋ฐ€๋„ ์ถ”์ •์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ํฌ์†Œ ๋ฐ€๋„ ์˜์—ญ ํ•™์Šต์šฉ์ด๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๋ฐ€์ง‘ ๋ฐ€๋„ ์˜์—ญ ํ•™์Šต์šฉ์ž…๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ์—์„œ ์ง€์—ญ์ ์œผ๋กœ ์ถ”์ •๋œ ๋ฐ€๋„๋งต์€ ์ดˆ๊ธฐ ๊ตฐ์ค‘๋ฐ€๋„๋กœ ์ถ”์ •๋˜๋ฉฐ ๊ฒŒ์ดํŒ… ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•™์ƒ๋ธ” ๋ฐฉ์‹์œผ๋กœ ์„ ํƒ์ ์œผ๋กœ ๊ฒฐํ•ฉ๋ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ๋ฐ€๋„๋งต์€ ์ด๋ฏธ์ง€์˜ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ํ•ด์ƒ๋„ ๋งต์œผ๋กœ ๊ฐœ์„ ๋ฉ๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ์—์„œ ์ƒˆ๋กœ์šด ์ ์‘ํ˜• ์†์‹ค ์ฒด๊ณ„๋ฅผ ์ ์šฉํ•˜์—ฌ ํ˜ผ์žกํ•œ ์ง€์—ญ์˜ ๋ชจํ˜ธ์„ฑ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ ๋ฐ€์ง‘๋„ ๋ฐ ํ›ˆ๋ จ ์ •๋„์— ๋”ฐ๋ผ ๋ฐ€๋„ ์†์‹ค๊ณผ ๊ณ„์ˆ˜ ์†์‹ค ์‚ฌ์ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๋ฐ€๋„๋งต ์ •ํ™•๋„์™€ ๊ณ„์ˆ˜ ์ •ํ™•๋„๋ฅผ ๋ชจ๋‘ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šค์ผ€์ผ์ด ๋‹ค๋ฅธ ๋‹ค์ค‘ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ • ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋Š” ์†Œ๊ทœ๋ชจ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์€ ๊ฐ ์‚ฌ๋žŒ์˜ ์ค‘์‹ฌ ์˜์—ญ ๋ฐ€๋„๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ •ํ•˜๋Š” ๋ฐ˜๋ฉด ๋Œ€๊ทœ๋ชจ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์€ ์‚ฌ๋žŒ์˜ ์ฃผ๋ณ€ ์˜์—ญ ๋ฐ€๋„๋ฅผ ์ž˜ ์ถ”์ •ํ•œ๋‹ค๋Š” ๊ฒฝํ—˜์  ๋ถ„์„์—์„œ ๋น„๋กฏ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตฐ์ค‘์— ์žˆ๋Š” ๊ฐ ์‚ฌ๋žŒ์˜ ์ค‘์‹ฌ์—์„œ ์ฃผ๋ณ€์œผ๋กœ ์ ์ฐจ์ ์œผ๋กœ ๊ตฐ์ค‘๋ฐ€๋„๋งต์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ™•์žฅ๋œ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์ด ์ž‘์€ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์—์„œ ํฐ ๋ธ”๋ก์œผ๋กœ ๊ณ„๋‹จ์‹์œผ๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์‚ฌ๋žŒ ๊ทœ๋ชจ์™€ ๊ตฐ์ค‘ ํฌ์†Œ์„ฑ์„ ๊ณ ๋ คํ•œ ์ƒˆ๋กœ์šด ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์€ ๊ตฐ์ค‘ ๊ณ„์‚ฐ์˜ ์ •ํ™•๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์€ ์†์œผ๋กœ ๋งŒ๋“  ์ง€์ƒ ์ง„์‹ค ๋ฐ€๋„์™€ ์žก์Œ์ด ์žˆ๋Š” ์ฃผ์„์˜ ํ•„์š”์„ฑ์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์ง€๋งŒ ํ˜ผ์žกํ•œ ์žฅ๋ฉด์—์„œ ์ •ํ™•ํ•˜๊ฒŒ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ์žฅ๋ฉด์—์„œ ์‚ฌ๋žŒ์˜ ์™ธ๋ชจ๋Š” ๊ฐ ์‚ฌ๋žŒ์˜ ํฌ๊ธฐ('์‚ฌ๋žŒ ํฌ๊ธฐ')์— ๋”ฐ๋ผ ๋ฐ”๋€๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตญ๋ถ€ ์˜์—ญ์˜ ํฌ์†Œ์„ฑ('๊ตฐ์ค‘ ํฌ์†Œ์„ฑ')์ด ๋‚ฎ์„์ˆ˜๋ก ๊ตฐ์ค‘ ๋ฐ€๋„๋ฅผ ์ถ”์ •ํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์žฅ๋ฉด ๊ธฐํ•˜์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ '์‚ฌ๋žŒ ํฌ๊ธฐ'๋ฅผ ์ถ”์ •ํ•œ ๋‹ค์Œ ์ถ”์ •๋œ '์‚ฌ๋žŒ ํฌ๊ธฐ'๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ '๊ตฐ์ค‘ ํฌ์†Œ์„ฑ'์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์ถ”์ •๋œ '์‚ฌ๋žŒ ํฌ๊ธฐ' ๋ฐ '๊ตฐ์ค‘ ํฌ์†Œ์„ฑ'์€ ์ƒˆ๋กœ์šด ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์—์„œ ์‚ฌ์šฉ๋˜์–ด ์  ์ฃผ์„์˜ ๊ต์‚ฌ ํ‘œํ˜„์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐ€๋„ ์ถ”์ •๊ธฐ์˜ ํšจ์œจ์„ฑ์€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๊ตฐ์ค‘ ๊ณ„์‚ฐ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ ๊ฐ์‹œ ํ™˜๊ฒฝ์—์„œ ์ตœ์ฒจ๋‹จ ๋ฐ€๋„ ์ถ”์ •๊ธฐ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ œ์•ˆ๋œ ๋ชจ๋“  ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์ž๊ฐ€๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ํšจ์œจ์„ฑ์„ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 2 Related Works 4 2.1 Detection-based Approaches 4 2.2 Regression-based Approaches 5 2.3 Deep learning-based Approaches 5 2.3.1 Network Structure Perspective 6 2.3.2 Training Strategy Perspective 7 3 Selective Ensemble Network for Accurate Crowd Density Estimation 9 3.1 Overview 9 3.2 Combining Patch-based and Image-based Approaches 11 3.2.1 Local-Global Cascade Network 14 3.2.2 Experiments 20 3.2.3 Summary 24 3.3 Selective Ensemble Network with Adjustable Counting Loss (SEN-ACL) 25 3.3.1 Overall Scheme 25 3.3.2 Data Description 27 3.3.3 Gating Network 27 3.3.4 Sparse / Dense Network 29 3.3.5 Refinement Network 32 3.4 Experiments 34 3.4.1 Implementation Details 34 3.4.2 Dataset and Evaluation Metrics 35 3.4.3 Self-evaluation on WorldExpo'10 dataset 35 3.4.4 Comparative Evaluation with State of the Art Methods 38 3.4.5 Analysis on the Proposed Components 40 3.5 Summary 40 4 Sequential Crowd Density Estimation from Center to Periphery of Crowd 43 4.1 Overview 43 4.2 Cascade Residual Dilated Network (CRDN) 47 4.2.1 Effects of Dilated Convolution in Crowd Counting 47 4.2.2 The Proposed Network 48 4.3 Experiments 52 4.3.1 Datasets and Experimental Settings 52 4.3.2 Implementation Details 52 4.3.3 Comparison with Other Methods 55 4.3.4 Ablation Study 56 4.3.5 Analysis on the Proposed Components 63 4.4 Conclusion 63 5 Congestion-aware Bayesian Loss for Crowd Counting 64 5.1 Overview 64 5.2 Congestion-aware Bayesian Loss 67 5.2.1 Person-Scale Estimation 67 5.2.2 Crowd-Sparsity Estimation 70 5.2.3 Design of The Proposed Loss 70 5.3 Experiments 74 5.3.1 Datasets 76 5.3.2 Implementation Details 77 5.3.3 Evaluation Metrics 77 5.3.4 Ablation Study 78 5.3.5 Comparisons with State of the Art 80 5.3.6 Differences from Existing Person-scale Inference 87 5.3.7 Analysis on the Proposed Components 88 5.4 Summary 90 6 Conclusion 91 Abstract (In Korean) 105๋ฐ•

    Focus for Free in Density-Based Counting

    Full text link
    This work considers supervised learning to count from images and their corresponding point annotations. Where density-based counting methods typically use the point annotations only to create Gaussian-density maps, which act as the supervision signal, the starting point of this work is that point annotations have counting potential beyond density map generation. We introduce two methods that repurpose the available point annotations to enhance counting performance. The first is a counting-specific augmentation that leverages point annotations to simulate occluded objects in both input and density images to enhance the network's robustness to occlusions. The second method, foreground distillation, generates foreground masks from the point annotations, from which we train an auxiliary network on images with blacked-out backgrounds. By doing so, it learns to extract foreground counting knowledge without interference from the background. These methods can be seamlessly integrated with existing counting advances and are adaptable to different loss functions. We demonstrate complementary effects of the approaches, allowing us to achieve robust counting results even in challenging scenarios such as background clutter, occlusion, and varying crowd densities. Our proposed approach achieves strong counting results on multiple datasets, including ShanghaiTech Part\_A and Part\_B, UCF\_QNRF, JHU-Crowd++, and NWPU-Crowd.Comment: 18 page

    A Recent Trend in Individual Counting Approach Using Deep Network

    Get PDF
    In video surveillance scheme, counting individuals is regarded as a crucial task. Of all the individual counting techniques in existence, the regression technique can offer enhanced performance under overcrowded area. However, this technique is unable to specify the details of counting individual such that it fails in locating the individual. On contrary, the density map approach is very effective to overcome the counting problems in various situations such as heavy overlapping and low resolution. Nevertheless, this approach may break down in cases when only the heads of individuals appear in video scenes, and it is also restricted to the featureโ€™s types. The popular technique to obtain the pertinent information automatically is Convolutional Neural Network (CNN). However, the CNN based counting scheme is unable to sufficiently tackle three difficulties, namely, distributions of non-uniform density, changes of scale and variation of drastic scale. In this study, we cater a review on current counting techniques which are in correlation with deep net in different applications of crowded scene. The goal of this work is to specify the effectiveness of CNN applied on popular individuals counting approaches for attaining higher precision results

    Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

    Full text link
    To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd

    Human Centered Computer Vision Techniques for Intelligent Video Surveillance Systems

    Get PDF
    Nowadays, intelligent video surveillance systems are being developed to support human operators in different monitoring and investigation tasks. Although relevant results have been achieved by the research community in several computer vision tasks, some real applications still exhibit several open issues. In this context, this thesis focused on two challenging computer vision tasks: person re-identification and crowd counting. Person re-identification aims to retrieve images of a person of interest, selected by the user, in different locations over time, reducing the time required to the user to analyse all the available videos. Crowd counting consists of estimating the number of people in a given image or video. Both tasks present several complex issues. In this thesis, a challenging video surveillance application scenario is considered in which it is not possible to collect and manually annotate images of a target scene (e.g., when a new camera installation is made by Law Enforcement Agency) to train a supervised model. Two human centered solutions for the above mentioned tasks are then proposed, in which the role of the human operators is fundamental. For person re-identification, the human-in-the-loop approach is proposed, which exploits the operator feedback on retrieved pedestrian images during system operation, to improve system's effectiveness. The proposed solution is based on revisiting relevance feedback algorithms for content-based image retrieval, and on developing a specific feedback protocol, to find a trade-off between the human effort and re-identification performance. For crowd counting, the use of a synthetic training set is proposed to develop a scene-specific model, based on a minimal amount of information of the target scene required to the user. Both solutions are empirically investigated using state-of-the-art supervised models based on Convolutional Neural Network, on benchmark data sets

    Object Counting with Deep Learning

    Get PDF
    This thesis explores various empirical aspects of deep learning or convolutional network based models for efficient object counting. First, we train moderately large convolutional networks on comparatively smaller datasets containing few hundred samples from scratch with conventional image processing based data augmentation. Then, we extend this approach for unconstrained, outdoor images using more advanced architectural concepts. Additionally, we propose an efficient, randomized data augmentation strategy based on sub-regional pixel distribution for low-resolution images. Next, the effectiveness of depth-to-space shuffling of feature elements for efficient segmentation is investigated for simpler problems like binary segmentation -- often required in the counting framework. This depth-to-space operation violates the basic assumption of encoder-decoder type of segmentation architectures. Consequently, it helps to train the encoder model as a sparsely connected graph. Nonetheless, we have found comparable accuracy to that of the standard encoder-decoder architectures with our depth-to-space models. After that, the subtleties regarding the lack of localization information in the conventional scalar count loss for one-look models are illustrated. At this point, without using additional annotations, a possible solution is proposed based on the regulation of a network-generated heatmap in the form of a weak, subsidiary loss. The models trained with this auxiliary loss alongside the conventional loss perform much better compared to their baseline counterparts, both qualitatively and quantitatively. Lastly, the intricacies of tiled prediction for high-resolution images are studied in detail, and a simple and effective trick of eliminating the normalization factor in an existing computational block is demonstrated. All of the approaches employed here are thoroughly benchmarked across multiple heterogeneous datasets for object counting against previous, state-of-the-art approaches

    Learning by correlation for computer vision applications: from Kernel methods to deep learning

    Get PDF
    Learning to spot analogies and differences within/across visual categories is an arguably powerful approach in machine learning and pattern recognition which is directly inspired by human cognition. In this thesis, we investigate a variety of approaches which are primarily driven by correlation and tackle several computer vision applications
    • โ€ฆ
    corecore