3,449 research outputs found

    Online real-time crowd behavior detection in video sequences

    Get PDF
    Automatically detecting events in crowded scenes is a challenging task in Computer Vision. A number of offline approaches have been proposed for solving the problem of crowd behavior detection, however the offline assumption limits their application in real-world video surveillance systems. In this paper, we propose an online and real-time method for detecting events in crowded video sequences. The proposed approach is based on the combination of visual feature extraction and image segmentation and it works without the need of a training phase. A quantitative experimental evaluation has been carried out on multiple publicly available video sequences, containing data from various crowd scenarios and different types of events, to demonstrate the effectiveness of the approach

    DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation

    Full text link
    In real-world crowd counting applications, the crowd densities vary greatly in spatial and temporal domains. A detection based counting method will estimate crowds accurately in low density scenes, while its reliability in congested areas is downgraded. A regression based approach, on the other hand, captures the general density information in crowded regions. Without knowing the location of each person, it tends to overestimate the count in low density areas. Thus, exclusively using either one of them is not sufficient to handle all kinds of scenes with varying densities. To address this issue, a novel end-to-end crowd counting framework, named DecideNet (DEteCtIon and Density Estimation Network) is proposed. It can adaptively decide the appropriate counting mode for different locations on the image based on its real density conditions. DecideNet starts with estimating the crowd density by generating detection and regression based density maps separately. To capture inevitable variation in densities, it incorporates an attention module, meant to adaptively assess the reliability of the two types of estimations. The final crowd counts are obtained with the guidance of the attention module to adopt suitable estimations from the two kinds of density maps. Experimental results show that our method achieves state-of-the-art performance on three challenging crowd counting datasets.Comment: CVPR 201

    Understanding Traffic Density from Large-Scale Web Camera Data

    Full text link
    Understanding traffic density from large-scale web camera (webcam) videos is a challenging problem because such videos have low spatial and temporal resolution, high occlusion and large perspective. To deeply understand traffic density, we explore both deep learning based and optimization based methods. To avoid individual vehicle detection and tracking, both methods map the image into vehicle density map, one based on rank constrained regression and the other one based on fully convolution networks (FCN). The regression based method learns different weights for different blocks in the image to increase freedom degrees of weights and embed perspective information. The FCN based method jointly estimates vehicle density map and vehicle count with a residual learning framework to perform end-to-end dense prediction, allowing arbitrary image resolution, and adapting to different vehicle scales and perspectives. We analyze and compare both methods, and get insights from optimization based method to improve deep model. Since existing datasets do not cover all the challenges in our work, we collected and labelled a large-scale traffic video dataset, containing 60 million frames from 212 webcams. Both methods are extensively evaluated and compared on different counting tasks and datasets. FCN based method significantly reduces the mean absolute error from 10.99 to 5.31 on the public dataset TRANCOS compared with the state-of-the-art baseline.Comment: Accepted by CVPR 2017. Preprint version was uploaded on http://welcome.isr.tecnico.ulisboa.pt/publications/understanding-traffic-density-from-large-scale-web-camera-data

    Class-Agnostic Counting

    Full text link
    Nearly all existing counting methods are designed for a specific object class. Our work, however, aims to create a counting model able to count any class of object. To achieve this goal, we formulate counting as a matching problem, enabling us to exploit the image self-similarity property that naturally exists in object counting problems. We make the following three contributions: first, a Generic Matching Network (GMN) architecture that can potentially count any object in a class-agnostic manner; second, by reformulating the counting problem as one of matching objects, we can take advantage of the abundance of video data labeled for tracking, which contains natural repetitions suitable for training a counting model. Such data enables us to train the GMN. Third, to customize the GMN to different user requirements, an adapter module is used to specialize the model with minimal effort, i.e. using a few labeled examples, and adapting only a small fraction of the trained parameters. This is a form of few-shot learning, which is practical for domains where labels are limited due to requiring expert knowledge (e.g. microbiology). We demonstrate the flexibility of our method on a diverse set of existing counting benchmarks: specifically cells, cars, and human crowds. The model achieves competitive performance on cell and crowd counting datasets, and surpasses the state-of-the-art on the car dataset using only three training images. When training on the entire dataset, the proposed method outperforms all previous methods by a large margin.Comment: Asian Conference on Computer Vision (ACCV), 201

    ๊ตฐ์ค‘ ๋ฐ€๋„ ์˜ˆ์ธก์„ ์œ„ํ•œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์™€ ํ›ˆ๋ จ๋ฐฉ๋ฒ•์˜ ํ˜ผ์žก๋„ ๋ฐ ํฌ๊ธฐ ์ธ์‹ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022.2. ์ตœ์ง„์˜.This dissertation presents novel deep learning-based crowd density estimation methods considering the crowd congestion and scale of people. Crowd density estimation is one of the important tasks for the intelligent surveillance system. Using the crowd density estimation, the region of interest for public security and safety can be easily indicated. It can also help advanced computer vision algorithms that are computationally expensive, such as pedestrian detection and tracking. After the introduction of deep learning to the crowd density estimation, most researches follow the conventional scheme that uses a convolutional neural network to learn the network to estimate crowd density map with training images. The deep learning-based crowd density estimation researches can consist of two perspectives; network structure perspective and training strategy perspective. In general, researches of network structure perspective propose a novel network structure to extract features to represent crowd well. On the other hand, those of the training strategy perspective propose a novel training methodology or a loss function to improve the counting performance. In this dissertation, I propose several works in both perspectives in deep learning-based crowd density estimation. In particular, I design the network models to be had rich crowd representation characteristics according to the crowd congestion and the scale of people. I propose two novel network structures: selective ensemble network and cascade residual dilated network. Also, I propose one novel loss function for the crowd density estimation: congestion-aware Bayesian loss. First, I propose a selective ensemble deep network architecture for crowd density estimation. In contrast to existing deep network-based methods, the proposed method incorporates two sub-networks for local density estimation: one to learn sparse density regions and one to learn dense density regions. Locally estimated density maps from the two sub-networks are selectively combined in an ensemble fashion using a gating network to estimate an initial crowd density map. The initial density map is refined as a high-resolution map, using another sub-network that draws on contextual information in the image. In training, a novel adaptive loss scheme is applied to resolve ambiguity in the crowded region. The proposed scheme improves both density map accuracy and counting accuracy by adjusting the weighting value between density loss and counting loss according to the degree of crowdness and training epochs. Second, I propose a novel crowd density estimation architecture, which is composed of multiple dilated convolutional neural network blocks with different scales. The proposed architecture is motivated by an empirical analysis that small-scale dilated convolution well estimates the center area density of each person, whereas large-scale dilated convolution well estimates the periphery area density of a person. To estimate the crowd density map gradually from the center to the periphery of each person in a crowd, the multiple dilated CNN blocks are trained in cascading from the small dilated CNN block to the large one. Third, I propose a novel congestion-aware Bayesian loss method that considers the person-scale and crowd-sparsity. Deep learning-based crowd density estimation can greatly improve the accuracy of crowd counting. Though a Bayesian loss method resolves the two problems of the need of a hand-crafted ground truth (GT) density and noisy annotations, counting accurately in high-congested scenes remains a challenging issue. In a crowd scene, people's appearances change according to the scale of each individual (i.e., the person-scale). Also, the lower the sparsity of a local region (i.e., the crowd-sparsity), the more difficult it is to estimate the crowd density. I estimate the person-scale based on scene geometry, and I then estimate the crowd-sparsity using the estimated person-scale. The estimated person-scale and crowd-sparsity are utilized in the novel congestion-aware Bayesian loss method to improve the supervising representation of the point annotations. The effectiveness of the proposed density estimators is validated through comparative experiments with state-of-the-art methods on widely-used crowd counting benchmark datasets. The proposed methods are achieved superior performance to the state-of-the-art density estimators on diverse surveillance environments. In addition, for all proposed crowd density estimation methods, the efficiency of each component is verified through several ablation experiments.๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๊ตฐ์ค‘์˜ ํ˜ผ์žก๋„์™€ ์‚ฌ๋žŒ์˜ ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์€ ์ง€๋Šฅํ˜• ๊ฐ์‹œ ์‹œ์Šคํ…œ์˜ ์ค‘์š”ํ•œ ๊ณผ์ œ๋“ค ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ณต ๋ณด์•ˆ ๋ฐ ์•ˆ์ „์— ๋Œ€ํ•œ ๊ด€์‹ฌ ์˜์—ญ์„ ์‰ฝ๊ฒŒ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ณดํ–‰์ž ๊ฐ์ง€, ์ถ”์  ๋“ฑ ์—ฐ์‚ฐ ๋ถ€๋‹ด์ด ๋†’์€ ๊ณ ๊ธ‰ ์ปดํ“จํ„ฐ ๋น„์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ง€๋Šฅํ˜• ๊ฐ์‹œ ์‹œ์Šคํ…œ์— ํšจ๊ณผ์ ์œผ๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋„์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์— ๋”ฅ ๋Ÿฌ๋‹์ด ๋„์ž…๋œ ํ›„ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋Š” ํ›ˆ๋ จ ์ด๋ฏธ์ง€๋กœ ๊ตฐ์ค‘ ๋ฐ€๋„ ๋งต์„ ์ถ”์ •ํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ๊ด€์Šต์ ์ธ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ์—ฐ๊ตฌ๋Š” ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ด€์ ๊ณผ ํ›ˆ๋ จ ์ „๋žต ๊ด€์ ์˜ ๋‘ ๊ฐ€์ง€ ๊ด€์ ์œผ๋กœ ๋‚˜๋‰  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ด€์ ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ตฐ์ค‘์„ ์ž˜ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ํ›ˆ๋ จ ์ „๋žต ๊ด€์ ์—์„œ๋Š” ๊ณ„์ˆ˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋ฐฉ๋ฒ•๋ก ์ด๋‚˜ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ •์—์„œ ๋‘ ๊ฐ€์ง€ ๊ด€์ ์—์„œ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ๊ฐ ์‚ฌ๋žŒ์˜ ๊ตฐ์ค‘ ํ˜ผ์žก๋„์™€ ๊ทœ๋ชจ์— ๋”ฐ๋ผ ํ’๋ถ€ํ•œ ๊ตฐ์ค‘ ํ‘œํ˜„ ํŠน์„ฑ์„ ๊ฐ–๋„๋ก ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค. ์„ ํƒ์  ์•™์ƒ๋ธ” ๋„คํŠธ์›Œํฌ์™€ ๊ณ„๋‹จ์‹ ์ž”์—ฌ ํ™•์žฅ ๋„คํŠธ์›Œํฌ์˜ ๋‘ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์†์‹ค ํ•จ์ˆ˜์ธ ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ €, ์ •ํ™•ํ•œ ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ •๊ณผ ์ธ์› ๊ณ„์ˆ˜๋ฅผ ์œ„ํ•œ ์„ ํƒ์  ์•™์ƒ๋ธ” ๋”ฅ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ๋”ฅ ๋„คํŠธ์›Œํฌ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ์ง€์—ญ ๋ฐ€๋„ ์ถ”์ •์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ํฌ์†Œ ๋ฐ€๋„ ์˜์—ญ ํ•™์Šต์šฉ์ด๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๋ฐ€์ง‘ ๋ฐ€๋„ ์˜์—ญ ํ•™์Šต์šฉ์ž…๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ์—์„œ ์ง€์—ญ์ ์œผ๋กœ ์ถ”์ •๋œ ๋ฐ€๋„๋งต์€ ์ดˆ๊ธฐ ๊ตฐ์ค‘๋ฐ€๋„๋กœ ์ถ”์ •๋˜๋ฉฐ ๊ฒŒ์ดํŒ… ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•™์ƒ๋ธ” ๋ฐฉ์‹์œผ๋กœ ์„ ํƒ์ ์œผ๋กœ ๊ฒฐํ•ฉ๋ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ๋ฐ€๋„๋งต์€ ์ด๋ฏธ์ง€์˜ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ํ•ด์ƒ๋„ ๋งต์œผ๋กœ ๊ฐœ์„ ๋ฉ๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ์—์„œ ์ƒˆ๋กœ์šด ์ ์‘ํ˜• ์†์‹ค ์ฒด๊ณ„๋ฅผ ์ ์šฉํ•˜์—ฌ ํ˜ผ์žกํ•œ ์ง€์—ญ์˜ ๋ชจํ˜ธ์„ฑ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ ๋ฐ€์ง‘๋„ ๋ฐ ํ›ˆ๋ จ ์ •๋„์— ๋”ฐ๋ผ ๋ฐ€๋„ ์†์‹ค๊ณผ ๊ณ„์ˆ˜ ์†์‹ค ์‚ฌ์ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๋ฐ€๋„๋งต ์ •ํ™•๋„์™€ ๊ณ„์ˆ˜ ์ •ํ™•๋„๋ฅผ ๋ชจ๋‘ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šค์ผ€์ผ์ด ๋‹ค๋ฅธ ๋‹ค์ค‘ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ • ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋Š” ์†Œ๊ทœ๋ชจ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์€ ๊ฐ ์‚ฌ๋žŒ์˜ ์ค‘์‹ฌ ์˜์—ญ ๋ฐ€๋„๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ •ํ•˜๋Š” ๋ฐ˜๋ฉด ๋Œ€๊ทœ๋ชจ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์€ ์‚ฌ๋žŒ์˜ ์ฃผ๋ณ€ ์˜์—ญ ๋ฐ€๋„๋ฅผ ์ž˜ ์ถ”์ •ํ•œ๋‹ค๋Š” ๊ฒฝํ—˜์  ๋ถ„์„์—์„œ ๋น„๋กฏ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตฐ์ค‘์— ์žˆ๋Š” ๊ฐ ์‚ฌ๋žŒ์˜ ์ค‘์‹ฌ์—์„œ ์ฃผ๋ณ€์œผ๋กœ ์ ์ฐจ์ ์œผ๋กœ ๊ตฐ์ค‘๋ฐ€๋„๋งต์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ™•์žฅ๋œ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์ด ์ž‘์€ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์—์„œ ํฐ ๋ธ”๋ก์œผ๋กœ ๊ณ„๋‹จ์‹์œผ๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์‚ฌ๋žŒ ๊ทœ๋ชจ์™€ ๊ตฐ์ค‘ ํฌ์†Œ์„ฑ์„ ๊ณ ๋ คํ•œ ์ƒˆ๋กœ์šด ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์€ ๊ตฐ์ค‘ ๊ณ„์‚ฐ์˜ ์ •ํ™•๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์€ ์†์œผ๋กœ ๋งŒ๋“  ์ง€์ƒ ์ง„์‹ค ๋ฐ€๋„์™€ ์žก์Œ์ด ์žˆ๋Š” ์ฃผ์„์˜ ํ•„์š”์„ฑ์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์ง€๋งŒ ํ˜ผ์žกํ•œ ์žฅ๋ฉด์—์„œ ์ •ํ™•ํ•˜๊ฒŒ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ์žฅ๋ฉด์—์„œ ์‚ฌ๋žŒ์˜ ์™ธ๋ชจ๋Š” ๊ฐ ์‚ฌ๋žŒ์˜ ํฌ๊ธฐ('์‚ฌ๋žŒ ํฌ๊ธฐ')์— ๋”ฐ๋ผ ๋ฐ”๋€๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตญ๋ถ€ ์˜์—ญ์˜ ํฌ์†Œ์„ฑ('๊ตฐ์ค‘ ํฌ์†Œ์„ฑ')์ด ๋‚ฎ์„์ˆ˜๋ก ๊ตฐ์ค‘ ๋ฐ€๋„๋ฅผ ์ถ”์ •ํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์žฅ๋ฉด ๊ธฐํ•˜์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ '์‚ฌ๋žŒ ํฌ๊ธฐ'๋ฅผ ์ถ”์ •ํ•œ ๋‹ค์Œ ์ถ”์ •๋œ '์‚ฌ๋žŒ ํฌ๊ธฐ'๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ '๊ตฐ์ค‘ ํฌ์†Œ์„ฑ'์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์ถ”์ •๋œ '์‚ฌ๋žŒ ํฌ๊ธฐ' ๋ฐ '๊ตฐ์ค‘ ํฌ์†Œ์„ฑ'์€ ์ƒˆ๋กœ์šด ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์—์„œ ์‚ฌ์šฉ๋˜์–ด ์  ์ฃผ์„์˜ ๊ต์‚ฌ ํ‘œํ˜„์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐ€๋„ ์ถ”์ •๊ธฐ์˜ ํšจ์œจ์„ฑ์€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๊ตฐ์ค‘ ๊ณ„์‚ฐ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ ๊ฐ์‹œ ํ™˜๊ฒฝ์—์„œ ์ตœ์ฒจ๋‹จ ๋ฐ€๋„ ์ถ”์ •๊ธฐ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ œ์•ˆ๋œ ๋ชจ๋“  ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์ž๊ฐ€๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ํšจ์œจ์„ฑ์„ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 2 Related Works 4 2.1 Detection-based Approaches 4 2.2 Regression-based Approaches 5 2.3 Deep learning-based Approaches 5 2.3.1 Network Structure Perspective 6 2.3.2 Training Strategy Perspective 7 3 Selective Ensemble Network for Accurate Crowd Density Estimation 9 3.1 Overview 9 3.2 Combining Patch-based and Image-based Approaches 11 3.2.1 Local-Global Cascade Network 14 3.2.2 Experiments 20 3.2.3 Summary 24 3.3 Selective Ensemble Network with Adjustable Counting Loss (SEN-ACL) 25 3.3.1 Overall Scheme 25 3.3.2 Data Description 27 3.3.3 Gating Network 27 3.3.4 Sparse / Dense Network 29 3.3.5 Refinement Network 32 3.4 Experiments 34 3.4.1 Implementation Details 34 3.4.2 Dataset and Evaluation Metrics 35 3.4.3 Self-evaluation on WorldExpo'10 dataset 35 3.4.4 Comparative Evaluation with State of the Art Methods 38 3.4.5 Analysis on the Proposed Components 40 3.5 Summary 40 4 Sequential Crowd Density Estimation from Center to Periphery of Crowd 43 4.1 Overview 43 4.2 Cascade Residual Dilated Network (CRDN) 47 4.2.1 Effects of Dilated Convolution in Crowd Counting 47 4.2.2 The Proposed Network 48 4.3 Experiments 52 4.3.1 Datasets and Experimental Settings 52 4.3.2 Implementation Details 52 4.3.3 Comparison with Other Methods 55 4.3.4 Ablation Study 56 4.3.5 Analysis on the Proposed Components 63 4.4 Conclusion 63 5 Congestion-aware Bayesian Loss for Crowd Counting 64 5.1 Overview 64 5.2 Congestion-aware Bayesian Loss 67 5.2.1 Person-Scale Estimation 67 5.2.2 Crowd-Sparsity Estimation 70 5.2.3 Design of The Proposed Loss 70 5.3 Experiments 74 5.3.1 Datasets 76 5.3.2 Implementation Details 77 5.3.3 Evaluation Metrics 77 5.3.4 Ablation Study 78 5.3.5 Comparisons with State of the Art 80 5.3.6 Differences from Existing Person-scale Inference 87 5.3.7 Analysis on the Proposed Components 88 5.4 Summary 90 6 Conclusion 91 Abstract (In Korean) 105๋ฐ•

    Shallow feature based dense attention network for crowd counting

    Get PDF
    While the performance of crowd counting via deep learning has been improved dramatically in the recent years, it remains an ingrained problem due to cluttered backgrounds and varying scales of people within an image. In this paper, we propose a Shallow feature based Dense Attention Network (SDANet) for crowd counting from still images, which diminishes the impact of backgrounds via involving a shallow feature based attention model, and meanwhile, captures multi-scale information via densely connecting hierarchical image features. Specifically, inspired by the observation that backgrounds and human crowds generally have noticeably different responses in shallow features, we decide to build our attention model upon shallow-feature maps, which results in accurate background-pixel detection. Moreover, considering that the most representative features of people across different scales can appear in different layers of a feature extraction network, to better keep them all, we propose to densely connect hierarchical image features of different layers and subsequently encode them for estimating crowd density. Experimental results on three benchmark datasets clearly demonstrate the superiority of SDANet when dealing with different scenarios. Particularly, on the challenging UCF CC 50 dataset, our method outperforms other existing methods by a large margin, as is evident from a remarkable 11.9% Mean Absolute Error (MAE) drop of our SDANet
    • โ€ฆ
    corecore