44 research outputs found

    Pedestrian Attribute Recognition: A Survey

    Full text link
    Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey: https://sites.google.com/view/ahu-pedestrianattributes

    Adaptive Temporal Encoding Network for Video Instance-level Human Parsing

    Full text link
    Beyond the existing single-person and multiple-person human parsing tasks in static images, this paper makes the first attempt to investigate a more realistic video instance-level human parsing that simultaneously segments out each person instance and parses each instance into more fine-grained parts (e.g., head, leg, dress). We introduce a novel Adaptive Temporal Encoding Network (ATEN) that alternatively performs temporal encoding among key frames and flow-guided feature propagation from other consecutive frames between two key frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce the instance-level parsing result for each key frame, which integrates both the global human parsing and instance-level human segmentation into a unified model. To balance between accuracy and efficiency, the flow-guided feature propagation is used to directly parse consecutive frames according to their identified temporal consistency with key frames. On the other hand, ATEN leverages the convolution gated recurrent units (convGRU) to exploit temporal changes over a series of key frames, which are further used to facilitate the frame-level instance-level parsing. By alternatively performing direct feature propagation between consistent frames and temporal encoding network among key frames, our ATEN achieves a good balance between frame-level accuracy and time efficiency, which is a common crucial problem in video object segmentation research. To demonstrate the superiority of our ATEN, extensive experiments are conducted on the most popular video segmentation benchmark (DAVIS) and a newly collected Video Instance-level Parsing (VIP) dataset, which is the first video instance-level human parsing dataset comprised of 404 sequences and over 20k frames with instance-level and pixel-wise annotations.Comment: To appear in ACM MM 2018. Code link: https://github.com/HCPLab-SYSU/ATEN. Dataset link: http://sysu-hcp.net/li

    Deep Learning Techniques for Video Instance Segmentation: A Survey

    Full text link
    Video instance segmentation, also known as multi-object tracking and segmentation, is an emerging computer vision research area introduced in 2019, aiming at detecting, segmenting, and tracking instances in videos simultaneously. By tackling the video instance segmentation tasks through effective analysis and utilization of visual information in videos, a range of computer vision-enabled applications (e.g., human action recognition, medical image processing, autonomous vehicle navigation, surveillance, etc) can be implemented. As deep-learning techniques take a dominant role in various computer vision areas, a plethora of deep-learning-based video instance segmentation schemes have been proposed. This survey offers a multifaceted view of deep-learning schemes for video instance segmentation, covering various architectural paradigms, along with comparisons of functional performance, model complexity, and computational overheads. In addition to the common architectural designs, auxiliary techniques for improving the performance of deep-learning models for video instance segmentation are compiled and discussed. Finally, we discuss a range of major challenges and directions for further investigations to help advance this promising research field

    Prediction of social dynamic agents and long-tailed learning challenges: a survey

    Get PDF
    Autonomous robots that can perform common tasks like driving, surveillance, and chores have the biggest potential for impact due to frequency of usage, and the biggest potential for risk due to direct interaction with humans. These tasks take place in openended environments where humans socially interact and pursue their goals in complex and diverse ways. To operate in such environments, such systems must predict this behaviour, especially when the behavior is unexpected and potentially dangerous. Therefore, we summarize trends in various types of tasks, modeling methods, datasets, and social interaction modules aimed at predicting the future location of dynamic, socially interactive agents. Furthermore, we describe long-tailed learning techniques from classification and regression problems that can be applied to prediction problems. To our knowledge this is the first work that reviews social interaction modeling within prediction, and long-tailed learning techniques within regression and prediction

    ๊ตฐ์ค‘ ๋ฐ€๋„ ์˜ˆ์ธก์„ ์œ„ํ•œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์™€ ํ›ˆ๋ จ๋ฐฉ๋ฒ•์˜ ํ˜ผ์žก๋„ ๋ฐ ํฌ๊ธฐ ์ธ์‹ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022.2. ์ตœ์ง„์˜.This dissertation presents novel deep learning-based crowd density estimation methods considering the crowd congestion and scale of people. Crowd density estimation is one of the important tasks for the intelligent surveillance system. Using the crowd density estimation, the region of interest for public security and safety can be easily indicated. It can also help advanced computer vision algorithms that are computationally expensive, such as pedestrian detection and tracking. After the introduction of deep learning to the crowd density estimation, most researches follow the conventional scheme that uses a convolutional neural network to learn the network to estimate crowd density map with training images. The deep learning-based crowd density estimation researches can consist of two perspectives; network structure perspective and training strategy perspective. In general, researches of network structure perspective propose a novel network structure to extract features to represent crowd well. On the other hand, those of the training strategy perspective propose a novel training methodology or a loss function to improve the counting performance. In this dissertation, I propose several works in both perspectives in deep learning-based crowd density estimation. In particular, I design the network models to be had rich crowd representation characteristics according to the crowd congestion and the scale of people. I propose two novel network structures: selective ensemble network and cascade residual dilated network. Also, I propose one novel loss function for the crowd density estimation: congestion-aware Bayesian loss. First, I propose a selective ensemble deep network architecture for crowd density estimation. In contrast to existing deep network-based methods, the proposed method incorporates two sub-networks for local density estimation: one to learn sparse density regions and one to learn dense density regions. Locally estimated density maps from the two sub-networks are selectively combined in an ensemble fashion using a gating network to estimate an initial crowd density map. The initial density map is refined as a high-resolution map, using another sub-network that draws on contextual information in the image. In training, a novel adaptive loss scheme is applied to resolve ambiguity in the crowded region. The proposed scheme improves both density map accuracy and counting accuracy by adjusting the weighting value between density loss and counting loss according to the degree of crowdness and training epochs. Second, I propose a novel crowd density estimation architecture, which is composed of multiple dilated convolutional neural network blocks with different scales. The proposed architecture is motivated by an empirical analysis that small-scale dilated convolution well estimates the center area density of each person, whereas large-scale dilated convolution well estimates the periphery area density of a person. To estimate the crowd density map gradually from the center to the periphery of each person in a crowd, the multiple dilated CNN blocks are trained in cascading from the small dilated CNN block to the large one. Third, I propose a novel congestion-aware Bayesian loss method that considers the person-scale and crowd-sparsity. Deep learning-based crowd density estimation can greatly improve the accuracy of crowd counting. Though a Bayesian loss method resolves the two problems of the need of a hand-crafted ground truth (GT) density and noisy annotations, counting accurately in high-congested scenes remains a challenging issue. In a crowd scene, people's appearances change according to the scale of each individual (i.e., the person-scale). Also, the lower the sparsity of a local region (i.e., the crowd-sparsity), the more difficult it is to estimate the crowd density. I estimate the person-scale based on scene geometry, and I then estimate the crowd-sparsity using the estimated person-scale. The estimated person-scale and crowd-sparsity are utilized in the novel congestion-aware Bayesian loss method to improve the supervising representation of the point annotations. The effectiveness of the proposed density estimators is validated through comparative experiments with state-of-the-art methods on widely-used crowd counting benchmark datasets. The proposed methods are achieved superior performance to the state-of-the-art density estimators on diverse surveillance environments. In addition, for all proposed crowd density estimation methods, the efficiency of each component is verified through several ablation experiments.๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๊ตฐ์ค‘์˜ ํ˜ผ์žก๋„์™€ ์‚ฌ๋žŒ์˜ ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์€ ์ง€๋Šฅํ˜• ๊ฐ์‹œ ์‹œ์Šคํ…œ์˜ ์ค‘์š”ํ•œ ๊ณผ์ œ๋“ค ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ณต ๋ณด์•ˆ ๋ฐ ์•ˆ์ „์— ๋Œ€ํ•œ ๊ด€์‹ฌ ์˜์—ญ์„ ์‰ฝ๊ฒŒ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ณดํ–‰์ž ๊ฐ์ง€, ์ถ”์  ๋“ฑ ์—ฐ์‚ฐ ๋ถ€๋‹ด์ด ๋†’์€ ๊ณ ๊ธ‰ ์ปดํ“จํ„ฐ ๋น„์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ง€๋Šฅํ˜• ๊ฐ์‹œ ์‹œ์Šคํ…œ์— ํšจ๊ณผ์ ์œผ๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋„์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์— ๋”ฅ ๋Ÿฌ๋‹์ด ๋„์ž…๋œ ํ›„ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋Š” ํ›ˆ๋ จ ์ด๋ฏธ์ง€๋กœ ๊ตฐ์ค‘ ๋ฐ€๋„ ๋งต์„ ์ถ”์ •ํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ๊ด€์Šต์ ์ธ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ์—ฐ๊ตฌ๋Š” ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ด€์ ๊ณผ ํ›ˆ๋ จ ์ „๋žต ๊ด€์ ์˜ ๋‘ ๊ฐ€์ง€ ๊ด€์ ์œผ๋กœ ๋‚˜๋‰  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ด€์ ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ตฐ์ค‘์„ ์ž˜ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ํ›ˆ๋ จ ์ „๋žต ๊ด€์ ์—์„œ๋Š” ๊ณ„์ˆ˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋ฐฉ๋ฒ•๋ก ์ด๋‚˜ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ •์—์„œ ๋‘ ๊ฐ€์ง€ ๊ด€์ ์—์„œ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ๊ฐ ์‚ฌ๋žŒ์˜ ๊ตฐ์ค‘ ํ˜ผ์žก๋„์™€ ๊ทœ๋ชจ์— ๋”ฐ๋ผ ํ’๋ถ€ํ•œ ๊ตฐ์ค‘ ํ‘œํ˜„ ํŠน์„ฑ์„ ๊ฐ–๋„๋ก ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค. ์„ ํƒ์  ์•™์ƒ๋ธ” ๋„คํŠธ์›Œํฌ์™€ ๊ณ„๋‹จ์‹ ์ž”์—ฌ ํ™•์žฅ ๋„คํŠธ์›Œํฌ์˜ ๋‘ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์†์‹ค ํ•จ์ˆ˜์ธ ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ €, ์ •ํ™•ํ•œ ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ •๊ณผ ์ธ์› ๊ณ„์ˆ˜๋ฅผ ์œ„ํ•œ ์„ ํƒ์  ์•™์ƒ๋ธ” ๋”ฅ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ๋”ฅ ๋„คํŠธ์›Œํฌ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ์ง€์—ญ ๋ฐ€๋„ ์ถ”์ •์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ํฌ์†Œ ๋ฐ€๋„ ์˜์—ญ ํ•™์Šต์šฉ์ด๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๋ฐ€์ง‘ ๋ฐ€๋„ ์˜์—ญ ํ•™์Šต์šฉ์ž…๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ์—์„œ ์ง€์—ญ์ ์œผ๋กœ ์ถ”์ •๋œ ๋ฐ€๋„๋งต์€ ์ดˆ๊ธฐ ๊ตฐ์ค‘๋ฐ€๋„๋กœ ์ถ”์ •๋˜๋ฉฐ ๊ฒŒ์ดํŒ… ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•™์ƒ๋ธ” ๋ฐฉ์‹์œผ๋กœ ์„ ํƒ์ ์œผ๋กœ ๊ฒฐํ•ฉ๋ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ๋ฐ€๋„๋งต์€ ์ด๋ฏธ์ง€์˜ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ํ•ด์ƒ๋„ ๋งต์œผ๋กœ ๊ฐœ์„ ๋ฉ๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ์—์„œ ์ƒˆ๋กœ์šด ์ ์‘ํ˜• ์†์‹ค ์ฒด๊ณ„๋ฅผ ์ ์šฉํ•˜์—ฌ ํ˜ผ์žกํ•œ ์ง€์—ญ์˜ ๋ชจํ˜ธ์„ฑ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ ๋ฐ€์ง‘๋„ ๋ฐ ํ›ˆ๋ จ ์ •๋„์— ๋”ฐ๋ผ ๋ฐ€๋„ ์†์‹ค๊ณผ ๊ณ„์ˆ˜ ์†์‹ค ์‚ฌ์ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๋ฐ€๋„๋งต ์ •ํ™•๋„์™€ ๊ณ„์ˆ˜ ์ •ํ™•๋„๋ฅผ ๋ชจ๋‘ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šค์ผ€์ผ์ด ๋‹ค๋ฅธ ๋‹ค์ค‘ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด ๊ตฐ์ค‘๋ฐ€๋„ ์ถ”์ • ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋Š” ์†Œ๊ทœ๋ชจ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์€ ๊ฐ ์‚ฌ๋žŒ์˜ ์ค‘์‹ฌ ์˜์—ญ ๋ฐ€๋„๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ •ํ•˜๋Š” ๋ฐ˜๋ฉด ๋Œ€๊ทœ๋ชจ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์€ ์‚ฌ๋žŒ์˜ ์ฃผ๋ณ€ ์˜์—ญ ๋ฐ€๋„๋ฅผ ์ž˜ ์ถ”์ •ํ•œ๋‹ค๋Š” ๊ฒฝํ—˜์  ๋ถ„์„์—์„œ ๋น„๋กฏ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตฐ์ค‘์— ์žˆ๋Š” ๊ฐ ์‚ฌ๋žŒ์˜ ์ค‘์‹ฌ์—์„œ ์ฃผ๋ณ€์œผ๋กœ ์ ์ฐจ์ ์œผ๋กœ ๊ตฐ์ค‘๋ฐ€๋„๋งต์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ™•์žฅ๋œ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์ด ์ž‘์€ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ๋ธ”๋ก์—์„œ ํฐ ๋ธ”๋ก์œผ๋กœ ๊ณ„๋‹จ์‹์œผ๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์‚ฌ๋žŒ ๊ทœ๋ชจ์™€ ๊ตฐ์ค‘ ํฌ์†Œ์„ฑ์„ ๊ณ ๋ คํ•œ ์ƒˆ๋กœ์šด ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ •์€ ๊ตฐ์ค‘ ๊ณ„์‚ฐ์˜ ์ •ํ™•๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์€ ์†์œผ๋กœ ๋งŒ๋“  ์ง€์ƒ ์ง„์‹ค ๋ฐ€๋„์™€ ์žก์Œ์ด ์žˆ๋Š” ์ฃผ์„์˜ ํ•„์š”์„ฑ์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์ง€๋งŒ ํ˜ผ์žกํ•œ ์žฅ๋ฉด์—์„œ ์ •ํ™•ํ•˜๊ฒŒ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ๊ตฐ์ค‘ ์žฅ๋ฉด์—์„œ ์‚ฌ๋žŒ์˜ ์™ธ๋ชจ๋Š” ๊ฐ ์‚ฌ๋žŒ์˜ ํฌ๊ธฐ('์‚ฌ๋žŒ ํฌ๊ธฐ')์— ๋”ฐ๋ผ ๋ฐ”๋€๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตญ๋ถ€ ์˜์—ญ์˜ ํฌ์†Œ์„ฑ('๊ตฐ์ค‘ ํฌ์†Œ์„ฑ')์ด ๋‚ฎ์„์ˆ˜๋ก ๊ตฐ์ค‘ ๋ฐ€๋„๋ฅผ ์ถ”์ •ํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์žฅ๋ฉด ๊ธฐํ•˜์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ '์‚ฌ๋žŒ ํฌ๊ธฐ'๋ฅผ ์ถ”์ •ํ•œ ๋‹ค์Œ ์ถ”์ •๋œ '์‚ฌ๋žŒ ํฌ๊ธฐ'๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ '๊ตฐ์ค‘ ํฌ์†Œ์„ฑ'์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์ถ”์ •๋œ '์‚ฌ๋žŒ ํฌ๊ธฐ' ๋ฐ '๊ตฐ์ค‘ ํฌ์†Œ์„ฑ'์€ ์ƒˆ๋กœ์šด ํ˜ผ์žก ์ธ์‹ ๋ฒ ์ด์ง€์•ˆ ์†์‹ค ๋ฐฉ๋ฒ•์—์„œ ์‚ฌ์šฉ๋˜์–ด ์  ์ฃผ์„์˜ ๊ต์‚ฌ ํ‘œํ˜„์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐ€๋„ ์ถ”์ •๊ธฐ์˜ ํšจ์œจ์„ฑ์€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๊ตฐ์ค‘ ๊ณ„์‚ฐ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ ๊ฐ์‹œ ํ™˜๊ฒฝ์—์„œ ์ตœ์ฒจ๋‹จ ๋ฐ€๋„ ์ถ”์ •๊ธฐ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ œ์•ˆ๋œ ๋ชจ๋“  ๊ตฐ์ค‘ ๋ฐ€๋„ ์ถ”์ • ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์ž๊ฐ€๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•ด ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ํšจ์œจ์„ฑ์„ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 2 Related Works 4 2.1 Detection-based Approaches 4 2.2 Regression-based Approaches 5 2.3 Deep learning-based Approaches 5 2.3.1 Network Structure Perspective 6 2.3.2 Training Strategy Perspective 7 3 Selective Ensemble Network for Accurate Crowd Density Estimation 9 3.1 Overview 9 3.2 Combining Patch-based and Image-based Approaches 11 3.2.1 Local-Global Cascade Network 14 3.2.2 Experiments 20 3.2.3 Summary 24 3.3 Selective Ensemble Network with Adjustable Counting Loss (SEN-ACL) 25 3.3.1 Overall Scheme 25 3.3.2 Data Description 27 3.3.3 Gating Network 27 3.3.4 Sparse / Dense Network 29 3.3.5 Refinement Network 32 3.4 Experiments 34 3.4.1 Implementation Details 34 3.4.2 Dataset and Evaluation Metrics 35 3.4.3 Self-evaluation on WorldExpo'10 dataset 35 3.4.4 Comparative Evaluation with State of the Art Methods 38 3.4.5 Analysis on the Proposed Components 40 3.5 Summary 40 4 Sequential Crowd Density Estimation from Center to Periphery of Crowd 43 4.1 Overview 43 4.2 Cascade Residual Dilated Network (CRDN) 47 4.2.1 Effects of Dilated Convolution in Crowd Counting 47 4.2.2 The Proposed Network 48 4.3 Experiments 52 4.3.1 Datasets and Experimental Settings 52 4.3.2 Implementation Details 52 4.3.3 Comparison with Other Methods 55 4.3.4 Ablation Study 56 4.3.5 Analysis on the Proposed Components 63 4.4 Conclusion 63 5 Congestion-aware Bayesian Loss for Crowd Counting 64 5.1 Overview 64 5.2 Congestion-aware Bayesian Loss 67 5.2.1 Person-Scale Estimation 67 5.2.2 Crowd-Sparsity Estimation 70 5.2.3 Design of The Proposed Loss 70 5.3 Experiments 74 5.3.1 Datasets 76 5.3.2 Implementation Details 77 5.3.3 Evaluation Metrics 77 5.3.4 Ablation Study 78 5.3.5 Comparisons with State of the Art 80 5.3.6 Differences from Existing Person-scale Inference 87 5.3.7 Analysis on the Proposed Components 88 5.4 Summary 90 6 Conclusion 91 Abstract (In Korean) 105๋ฐ•
    corecore