Search CORE

44 research outputs found

Pedestrian Attribute Recognition: A Survey

Author: Luo Bin
Tang Jin
Wang Xiao
Yang Rui
Zheng Shaofei
Publication venue
Publication date: 22/01/2019
Field of study

Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey: https://sites.google.com/view/ahu-pedestrianattributes

arXiv.org e-Print Archive

Adaptive Temporal Encoding Network for Video Instance-level Human Parsing

Author: Chen Liang-Chieh
Jin Xiaojie
Liu Si
Tokmakov Pavel
Zhu Xizhou
Zhu Xizhou
Publication venue
Publication date: 10/08/2018
Field of study

Beyond the existing single-person and multiple-person human parsing tasks in static images, this paper makes the first attempt to investigate a more realistic video instance-level human parsing that simultaneously segments out each person instance and parses each instance into more fine-grained parts (e.g., head, leg, dress). We introduce a novel Adaptive Temporal Encoding Network (ATEN) that alternatively performs temporal encoding among key frames and flow-guided feature propagation from other consecutive frames between two key frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce the instance-level parsing result for each key frame, which integrates both the global human parsing and instance-level human segmentation into a unified model. To balance between accuracy and efficiency, the flow-guided feature propagation is used to directly parse consecutive frames according to their identified temporal consistency with key frames. On the other hand, ATEN leverages the convolution gated recurrent units (convGRU) to exploit temporal changes over a series of key frames, which are further used to facilitate the frame-level instance-level parsing. By alternatively performing direct feature propagation between consistent frames and temporal encoding network among key frames, our ATEN achieves a good balance between frame-level accuracy and time efficiency, which is a common crucial problem in video object segmentation research. To demonstrate the superiority of our ATEN, extensive experiments are conducted on the most popular video segmentation benchmark (DAVIS) and a newly collected Video Instance-level Parsing (VIP) dataset, which is the first video instance-level human parsing dataset comprised of 404 sequences and over 20k frames with instance-level and pixel-wise annotations.Comment: To appear in ACM MM 2018. Code link: https://github.com/HCPLab-SYSU/ATEN. Dataset link: http://sysu-hcp.net/li

arXiv.org e-Print Archive

Crossref

Deep Learning Techniques for Video Instance Segmentation: A Survey

Author: Creighton Douglas
Hu Yongjian
Li Chang-Tsun
Lim Chee Peng
Xu Chenhao
Publication venue
Publication date: 18/10/2023
Field of study

Video instance segmentation, also known as multi-object tracking and segmentation, is an emerging computer vision research area introduced in 2019, aiming at detecting, segmenting, and tracking instances in videos simultaneously. By tackling the video instance segmentation tasks through effective analysis and utilization of visual information in videos, a range of computer vision-enabled applications (e.g., human action recognition, medical image processing, autonomous vehicle navigation, surveillance, etc) can be implemented. As deep-learning techniques take a dominant role in various computer vision areas, a plethora of deep-learning-based video instance segmentation schemes have been proposed. This survey offers a multifaceted view of deep-learning schemes for video instance segmentation, covering various architectural paradigms, along with comparisons of functional performance, model complexity, and computational overheads. In addition to the common architectural designs, auxiliary techniques for improving the performance of deep-learning models for video instance segmentation are compiled and discussed. Finally, we discuss a range of major challenges and directions for further investigations to help advance this promising research field

arXiv.org e-Print Archive

Prediction of social dynamic agents and long-tailed learning challenges: a survey

Author: Kunze Lars
Thuremella Divya
Publication venue: AI Access Foundation
Publication date: 29/08/2023
Field of study

Autonomous robots that can perform common tasks like driving, surveillance, and chores have the biggest potential for impact due to frequency of usage, and the biggest potential for risk due to direct interaction with humans. These tasks take place in openended environments where humans socially interact and pursue their goals in complex and diverse ways. To operate in such environments, such systems must predict this behaviour, especially when the behavior is unexpected and potentially dangerous. Therefore, we summarize trends in various types of tasks, modeling methods, datasets, and social interaction modules aimed at predicting the future location of dynamic, socially interactive agents. Furthermore, we describe long-tailed learning techniques from classification and regression problems that can be applied to prediction problems. To our knowledge this is the first work that reviews social interaction modeling within prediction, and long-tailed learning techniques within regression and prediction

Oxford University Research Archive

군중 밀도 예측을 위한 네트워크 구조와 훈련방법의 혼잡도 및 크기 인식 설계

Author: 정지엽
Publication venue: 서울대학교 대학원
Publication date: 01/02/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022.2. 최진영.This dissertation presents novel deep learning-based crowd density estimation methods considering the crowd congestion and scale of people. Crowd density estimation is one of the important tasks for the intelligent surveillance system. Using the crowd density estimation, the region of interest for public security and safety can be easily indicated. It can also help advanced computer vision algorithms that are computationally expensive, such as pedestrian detection and tracking. After the introduction of deep learning to the crowd density estimation, most researches follow the conventional scheme that uses a convolutional neural network to learn the network to estimate crowd density map with training images. The deep learning-based crowd density estimation researches can consist of two perspectives; network structure perspective and training strategy perspective. In general, researches of network structure perspective propose a novel network structure to extract features to represent crowd well. On the other hand, those of the training strategy perspective propose a novel training methodology or a loss function to improve the counting performance. In this dissertation, I propose several works in both perspectives in deep learning-based crowd density estimation. In particular, I design the network models to be had rich crowd representation characteristics according to the crowd congestion and the scale of people. I propose two novel network structures: selective ensemble network and cascade residual dilated network. Also, I propose one novel loss function for the crowd density estimation: congestion-aware Bayesian loss. First, I propose a selective ensemble deep network architecture for crowd density estimation. In contrast to existing deep network-based methods, the proposed method incorporates two sub-networks for local density estimation: one to learn sparse density regions and one to learn dense density regions. Locally estimated density maps from the two sub-networks are selectively combined in an ensemble fashion using a gating network to estimate an initial crowd density map. The initial density map is refined as a high-resolution map, using another sub-network that draws on contextual information in the image. In training, a novel adaptive loss scheme is applied to resolve ambiguity in the crowded region. The proposed scheme improves both density map accuracy and counting accuracy by adjusting the weighting value between density loss and counting loss according to the degree of crowdness and training epochs. Second, I propose a novel crowd density estimation architecture, which is composed of multiple dilated convolutional neural network blocks with different scales. The proposed architecture is motivated by an empirical analysis that small-scale dilated convolution well estimates the center area density of each person, whereas large-scale dilated convolution well estimates the periphery area density of a person. To estimate the crowd density map gradually from the center to the periphery of each person in a crowd, the multiple dilated CNN blocks are trained in cascading from the small dilated CNN block to the large one. Third, I propose a novel congestion-aware Bayesian loss method that considers the person-scale and crowd-sparsity. Deep learning-based crowd density estimation can greatly improve the accuracy of crowd counting. Though a Bayesian loss method resolves the two problems of the need of a hand-crafted ground truth (GT) density and noisy annotations, counting accurately in high-congested scenes remains a challenging issue. In a crowd scene, people's appearances change according to the scale of each individual (i.e., the person-scale). Also, the lower the sparsity of a local region (i.e., the crowd-sparsity), the more difficult it is to estimate the crowd density. I estimate the person-scale based on scene geometry, and I then estimate the crowd-sparsity using the estimated person-scale. The estimated person-scale and crowd-sparsity are utilized in the novel congestion-aware Bayesian loss method to improve the supervising representation of the point annotations. The effectiveness of the proposed density estimators is validated through comparative experiments with state-of-the-art methods on widely-used crowd counting benchmark datasets. The proposed methods are achieved superior performance to the state-of-the-art density estimators on diverse surveillance environments. In addition, for all proposed crowd density estimation methods, the efficiency of each component is verified through several ablation experiments.본 학위논문에서는 군중의 혼잡도와 사람의 크기를 고려한 딥러닝 기반의 새로운 군중 밀도 추정 방법을 제시합니다. 군중 밀도 추정은 지능형 감시 시스템의 중요한 과제들 중 하나입니다. 군중 밀도 추정을 사용하여 공공 보안 및 안전에 대한 관심 영역을 쉽게 표시할 수 있습니다. 또한 이를 이용하면 보행자 감지, 추적 등 연산 부담이 높은 고급 컴퓨터 비전 알고리즘이 지능형 감시 시스템에 효과적으로 적용하는 것을 도울 수 있습니다. 군중 밀도 추정에 딥 러닝이 도입된 후 대부분의 연구는 훈련 이미지로 군중 밀도 맵을 추정하는 네트워크를 학습하기 위해 컨볼루션 신경망을 사용하는 관습적인 방식을 따릅니다. 딥 러닝 기반 군중 밀도 추정 연구는 네트워크 구조 관점과 훈련 전략 관점의 두 가지 관점으로 나뉠 수 있습니다. 일반적으로 네트워크 구조 관점의 연구에서는 군중을 잘 표현하기 위한 특징을 추출하기 위한 새로운 네트워크 구조를 제안합니다. 반면 훈련 전략 관점에서는 계수 성능을 향상시키기 위해 새로운 훈련 방법론이나 손실 함수를 제안합니다. 본 학위논문에서는 딥러닝 기반 군중밀도 추정에서 두 가지 관점에서 여러 연구를 제안합니다. 특히, 각 사람의 군중 혼잡도와 규모에 따라 풍부한 군중 표현 특성을 갖도록 제안하는 모델을 설계합니다. 선택적 앙상블 네트워크와 계단식 잔여 확장 네트워크의 두 가지 새로운 네트워크 구조를 제안합니다. 또한 군중 밀도 추정을 위한 새로운 손실 함수인 혼잡 인식 베이지안 손실을 제안합니다. 먼저, 정확한 군중밀도 추정과 인원 계수를 위한 선택적 앙상블 딥 네트워크 구조를 제안합니다. 기존 딥 네트워크 기반 방법과 달리 제안된 방법은 지역 밀도 추정을 위해 두 개의 하위 네트워크를 통합합니다. 하나는 희소 밀도 영역 학습용이고 다른 하나는 밀집 밀도 영역 학습용입니다. 두 개의 하위 네트워크에서 지역적으로 추정된 밀도맵은 초기 군중밀도로 추정되며 게이팅 네트워크를 사용하여 앙상블 방식으로 선택적으로 결합됩니다. 초기 밀도맵은 이미지의 컨텍스트 정보를 기반으로 하는 또 다른 하위 네트워크를 사용하여 고해상도 맵으로 개선됩니다. 네트워크 훈련에서 새로운 적응형 손실 체계를 적용하여 혼잡한 지역의 모호성을 해결합니다. 제안된 기법은 밀집도 및 훈련 정도에 따라 밀도 손실과 계수 손실 사이의 가중치를 조정하여 밀도맵 정확도와 계수 정확도를 모두 향상시킵니다. 두 번째로, 스케일이 다른 다중 확장 컨볼루션 블록으로 구성된 새로운 군중밀도 추정 네트워크 구조를 제안합니다. 제안된 네트워크 구조는 소규모 확장 컨볼루션은 각 사람의 중심 영역 밀도를 정확히 추정하는 반면 대규모 확장 컨볼루션은 사람의 주변 영역 밀도를 잘 추정한다는 경험적 분석에서 비롯되었습니다. 군중에 있는 각 사람의 중심에서 주변으로 점차적으로 군중밀도맵을 추정하기 위해 여러 확장된 컨볼루션 블록이 작은 확장 컨볼루션 블록에서 큰 블록으로 계단식으로 훈련됩니다. 마지막으로, 사람 규모와 군중 희소성을 고려한 새로운 혼잡 인식 베이지안 손실 방법을 제안합니다. 딥 러닝 기반 군중 밀도 추정은 군중 계산의 정확도를 크게 향상시킬 수 있습니다. 베이지안 손실 방법은 손으로 만든 지상 진실 밀도와 잡음이 있는 주석의 필요성이라는 두 가지 문제를 해결하지만 혼잡한 장면에서 정확하게 계산하는 것은 여전히 어려운 문제입니다. 군중 장면에서 사람의 외모는 각 사람의 크기('사람 크기')에 따라 바뀝니다. 또한 국부 영역의 희소성('군중 희소성')이 낮을수록 군중 밀도를 추정하기가 더 어렵습니다. 장면 기하정보를 기반으로 '사람 크기'를 추정한 다음 추정된 '사람 크기'를 사용하여 '군중 희소성'을 추정합니다. 추정된 '사람 크기' 및 '군중 희소성'은 새로운 혼잡 인식 베이지안 손실 방법에서 사용되어 점 주석의 교사 표현을 개선합니다. 제안된 밀도 추정기의 효율성은 널리 사용되는 군중 계산 벤치마크 데이터 세트에 대한 최첨단 방법과의 비교 실험을 통해 검증되었습니다. 제안된 방법은 다양한 감시 환경에서 최첨단 밀도 추정기보다 우수한 성능을 달성했습니다. 또한 제안된 모든 군중 밀도 추정 방법에 대해 여러 자가비교 실험을 통해 각 구성 요소의 효율성을 검증했습니다.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 2 Related Works 4 2.1 Detection-based Approaches 4 2.2 Regression-based Approaches 5 2.3 Deep learning-based Approaches 5 2.3.1 Network Structure Perspective 6 2.3.2 Training Strategy Perspective 7 3 Selective Ensemble Network for Accurate Crowd Density Estimation 9 3.1 Overview 9 3.2 Combining Patch-based and Image-based Approaches 11 3.2.1 Local-Global Cascade Network 14 3.2.2 Experiments 20 3.2.3 Summary 24 3.3 Selective Ensemble Network with Adjustable Counting Loss (SEN-ACL) 25 3.3.1 Overall Scheme 25 3.3.2 Data Description 27 3.3.3 Gating Network 27 3.3.4 Sparse / Dense Network 29 3.3.5 Refinement Network 32 3.4 Experiments 34 3.4.1 Implementation Details 34 3.4.2 Dataset and Evaluation Metrics 35 3.4.3 Self-evaluation on WorldExpo'10 dataset 35 3.4.4 Comparative Evaluation with State of the Art Methods 38 3.4.5 Analysis on the Proposed Components 40 3.5 Summary 40 4 Sequential Crowd Density Estimation from Center to Periphery of Crowd 43 4.1 Overview 43 4.2 Cascade Residual Dilated Network (CRDN) 47 4.2.1 Effects of Dilated Convolution in Crowd Counting 47 4.2.2 The Proposed Network 48 4.3 Experiments 52 4.3.1 Datasets and Experimental Settings 52 4.3.2 Implementation Details 52 4.3.3 Comparison with Other Methods 55 4.3.4 Ablation Study 56 4.3.5 Analysis on the Proposed Components 63 4.4 Conclusion 63 5 Congestion-aware Bayesian Loss for Crowd Counting 64 5.1 Overview 64 5.2 Congestion-aware Bayesian Loss 67 5.2.1 Person-Scale Estimation 67 5.2.2 Crowd-Sparsity Estimation 70 5.2.3 Design of The Proposed Loss 70 5.3 Experiments 74 5.3.1 Datasets 76 5.3.2 Implementation Details 77 5.3.3 Evaluation Metrics 77 5.3.4 Ablation Study 78 5.3.5 Comparisons with State of the Art 80 5.3.6 Differences from Existing Person-scale Inference 87 5.3.7 Analysis on the Proposed Components 88 5.4 Summary 90 6 Conclusion 91 Abstract (In Korean) 105박

SNU Open Repository and Archive