909 research outputs found

    Large-Scale Mapping of Human Activity using Geo-Tagged Videos

    Full text link
    This paper is the first work to perform spatio-temporal mapping of human activity using the visual content of geo-tagged videos. We utilize a recent deep-learning based video analysis framework, termed hidden two-stream networks, to recognize a range of activities in YouTube videos. This framework is efficient and can run in real time or faster which is important for recognizing events as they occur in streaming video or for reducing latency in analyzing already captured video. This is, in turn, important for using video in smart-city applications. We perform a series of experiments to show our approach is able to accurately map activities both spatially and temporally. We also demonstrate the advantages of using the visual content over the tags/titles.Comment: Accepted at ACM SIGSPATIAL 201

    Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection

    Full text link
    The increasing number of surveillance cameras and security concerns have made automatic violent activity detection from surveillance footage an active area for research. Modern deep learning methods have achieved good accuracy in violence detection and proved to be successful because of their applicability in intelligent surveillance systems. However, the models are computationally expensive and large in size because of their inefficient methods for feature extraction. This work presents a novel architecture for violence detection called Two-stream Multi-dimensional Convolutional Network (2s-MDCN), which uses RGB frames and optical flow to detect violence. Our proposed method extracts temporal and spatial information independently by 1D, 2D, and 3D convolutions. Despite combining multi-dimensional convolutional networks, our models are lightweight and efficient due to reduced channel capacity, yet they learn to extract meaningful spatial and temporal information. Additionally, combining RGB frames and optical flow yields 2.2% more accuracy than a single RGB stream. Regardless of having less complexity, our models obtained state-of-the-art accuracy of 89.7% on the largest violence detection benchmark dataset.Comment: 8 pages, 6 figure

    Study on Deep Learning Techniques for Finding Suspicious Violence Detection in a Video Surveillance

    Get PDF
    Detecting suspicious visual objects is essential to applying automatic violence detection (AVD) in video surveillance. Continuous monitoring of objects or any unusual things is a tedious task. Learning about video surveillance is an emerging research problem in AVD applications. Deep learning is an intelligent and trustworthy technique for detecting or classifying suspicious data objects. It classifies suspicious video frames by modeling specific categories of videos. The current deep models convolutional neural network (CNN), convolutional long-term and short-term memory (ConvLSTM), AlexNet, VGG-16, MobileNet, and GoogleNet, are wildly succeeded in real-time violence detection with the input of video clips. This paper presents the findings of experimental studies for deep models using classification measures to demonstrate the models' efficacy for our AVD application. Benchmarked violence (V), non-violence (NV), and weapon violence (WV) video datasets are used in the experiment to describe the model's performance while classifying suspicious videos for public safety

    A fully integrated violence detection system using CNN and LSTM

    Get PDF
    Recently, the number of violence-related cases in places such as remote roads, pathways, shopping malls, elevators, sports stadiums, and liquor shops, has increased drastically which are unfortunately discovered only after itโ€™s too late. The aim is to create a complete system that can perform real-time video analysis which will help recognize the presence of any violent activities and notify the same to the concerned authority, such as the police department of the corresponding area. Using the deep learning networks CNN and LSTM along with a well-defined system architecture, we have achieved an efficient solution that can be used for real-time analysis of video footage so that the concerned authority can monitor the situation through a mobile application that can notify about an occurrence of a violent event immediately

    ์ƒํ™ฉ ํŒ๋‹จ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํญ๋ ฅ ๊ฐ์ง€ ์ธ๊ณต ์ง€๋Šฅ ๋ชจ๋ธ์˜ ํšจ์œจํ™”: ๊ฐ์‹œ ์นด๋ฉ”๋ผ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ค‘์‹ฌ์œผ๋กœ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค๋Œ€ํ•™์› ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šคํ•™๊ณผ, 2023. 2. ๊น€ํ˜•์‹ .Recently, CCTVs are installed everywhere and play an important role in crime prevention and investigation. However, there is a problem in that a huge amount of manpower is required to monitor CCTV recordings. From this point of view, deep learning (DNN) based models that can automatically detect violence have been developed. However, they used heavy architectures such as 3D convolution or LSTM to process video data. For this reason, they require offloading to the central server for recordings to be processed so that incur huge transmission cost and privacy concern. Furthermore, given violence does not occur frequently, it is inefficient to run heavy video recognition model all the time. To solve these problems, this study proposes WhenToWatch, to enhance efficiency of violence detection system on surveillance camera. Main goals of this study are as follows: (1) To devise DNN-based violence detection system fully run on the CCTV devices to avoid offloading cost and privacy issues. (2) To reduce energy consumption of the device and processing time by introducing pre-screening module checking existence of people and deciding whether violence detection model should be executed or not. (3) To minimize computation overhead of the pre-screening module by combining lightweight non-DNN based methods and executing them according to previous status. In conclusion, WhenToWatch can be helpful when running violence detection models on edge devices such as CCTV, where power and computing resources are limited. Experiments show that WhenToWatch can reduce the execution of the violence detection model by 17% on the RWF-2000 dataset and 31% on the CCTV-Busan dataset. In addition, WhenToWatch reduces average processing time per a video from 310.46 seconds to 255.60 seconds and average power consumption from 3,303mW to 3,100mW on Jetson Nano, confirming it contributes to efficient on-device system operation.์ตœ๊ทผ์—๋Š” ์•ˆ์ „์„ ์œ„ํ•ด CCTV๊ฐ€ ๊ณณ๊ณณ์— ์„ค์น˜๋˜์–ด ์žˆ์œผ๋ฉฐ ๋ฒ”์ฃ„ ์˜ˆ๋ฐฉ ๋ฐ ์ˆ˜์‚ฌ์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ CCTV ์˜์ƒ๋“ค์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ฐ์‹œํ•˜๊ฑฐ๋‚˜ ๋…นํ™”๋œ ์˜์ƒ์„ ์žฌ๊ฒ€ํ† ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ง‰๋Œ€ํ•œ ์ธ๋ ฅ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€์ ์—์„œ ์ž๋™์œผ๋กœ ํญ๋ ฅ์„ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค์ด ๊พธ์ค€ํžˆ ๊ฐœ๋ฐœ๋˜์–ด์™”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์€ 3D ์ปจ๋ณผ๋ฃจ์…˜, LSTM ๋“ฑ์˜ ๋ฌด๊ฑฐ์šด ์˜์ƒ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— CCTV ๋””๋ฐ”์ด์Šค ๋‚ด์—์„œ์˜ ์ถ”๋ก ์€ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ–ˆ๊ณ , ์„œ๋ฒ„๋กœ ์˜์ƒ์„ ์ „์†กํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์„ ์ „์ œ๋กœ ํ•œ๋‹ค. ์ด ๊ฒฝ์šฐ ๋ง‰๋Œ€ํ•œ ์ „์†ก ๋น„์šฉ์ด ๋ฐœ์ƒํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์‚ฌ์ƒํ™œ ์นจํ•ด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์†Œ์ง€๊ฐ€ ์žˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ํญ๋ ฅ์€ ์ผ๋ฐ˜์ ์ธ ์‚ฌ๊ฑด์— ๋น„ํ•ด ๋ฐœ์ƒ ๋นˆ๋„๊ฐ€ ๋‚ฎ๋‹ค๋Š” ์ ์„ ๊ณ ๋ คํ•œ๋‹ค๋ฉด CCTV ๋™์ž‘ ์‹œ๊ฐ„ ๋‚ด๋‚ด ๋ฌด๊ฑฐ์šด ํญ๋ ฅ ๊ฐ์ง€ ๋ชจ๋ธ์„ ๊ตฌ๋™ํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ณ  ํญ๋ ฅ ๊ฐ์ง€ ์‹œ์Šคํ…œ์˜ ํšจ์œจ์„ฑ์„ ์ œ๊ณ ํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” WhenToWatch๋ผ๋Š” ํญ๋ ฅ ๊ฐ์ง€ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์˜ ์ฃผ์š” ๋ชฉ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ๋ฐ์ดํ„ฐ ์ „์†ก ๋น„์šฉ์„ ์ตœ์†Œํ™”ํ•˜๊ณ  ๊ฐœ์ธ์ •๋ณด๋ฅผ ๋ณดํ˜ธํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ์‹œ์นด๋ฉ”๋ผ ์žฅ์น˜ ๋‚ด์—์„œ ๊ตฌ๋™ ๊ฐ€๋Šฅํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํญ๋ ฅ ๊ฐ์ง€ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. (2) ๊ฐ์‹œ์นด๋ฉ”๋ผ ์žฅ์น˜์˜ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰๊ณผ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์ „ ํŒ๋‹จ ๋ชจ๋“ˆ์„ ๋„์ž…ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‚ฌ๋žŒ์˜ ์กด์žฌ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜๊ณ  ํญ๋ ฅ ๊ฐ์ง€ ๋ชจ๋ธ์˜ ์‹คํ–‰ ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •ํ•จ์œผ๋กœ์จ ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. (3) ์‚ฌ์ „ ํŒ๋‹จ ๋ชจ๋“ˆ๋กœ ์ธํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ์‚ฐ๋Ÿ‰ ๋ถ€๋‹ด์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์‹คํ–‰์†๋„๊ฐ€ ๋น ๋ฅธ ๋น„ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ๊ฒฐํ•ฉํ•œ ์‹œ์Šคํ…œ์„ ๋””์ž์ธํ•˜๊ณ , ์ด์ „ ์ƒํƒœ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•œ๋‹ค. ์ตœ์ข…์ ์œผ๋กœ WhenToWatch๋Š” CCTV์™€ ๊ฐ™์ด ๋ฆฌ์†Œ์Šค๊ฐ€ ์ œํ•œ๋œ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์—์„œ ํญ๋ ฅ ๊ฐ์ง€ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ๊ตฌ๋™ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. ์‹ค์ œ ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ์‚ฌ์ „ ํŒ๋‹จ ๋ชจ๋“ˆ์„ ์ ์šฉํ–ˆ์„ ๋•Œ, ํญ๋ ฅ ๊ฐ์ง€ ๋ชจ๋ธ์˜ ์‹คํ–‰ ํšŸ์ˆ˜๋Š” RWF-2000 ๋ฐ์ดํ„ฐ์…‹์—์„œ ์•ฝ 17% ๊ฐ์†Œํ–ˆ์œผ๋ฉฐ CCTV-Busan ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ์•ฝ 31% ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ์‹œ์Šคํ…œ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋ณด๋‹ค ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์šด์˜์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ ์ ฏ์Šจ ๋‚˜๋…ธ์—์„œ ํ‰๊ท  ๋น„๋””์˜ค ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์€ 310.46์ดˆ์—์„œ 255.60์ดˆ๋กœ ๊ฐ์†Œํ•˜์˜€์œผ๋ฉฐ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰์€ 3,303mW์—์„œ 3,100mW๋กœ ๊ฐ์†Œํ•˜์—ฌ WhenToWatch๊ฐ€ ํšจ์œจ์ ์ธ ์˜จ๋””๋ฐ”์ด์Šค ์‹œ์Šคํ…œ ์šด์˜์— ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.1 Introduction 1 2 Related Work 6 2.1 Violence Detection 6 2.2 Edge AI 7 2.3 Early-skipping in Neural Networks 8 3 Methodology 9 3.1 WhenToWatch Overview 9 3.2 Implementation Details of Sub-modules 12 3.3 Dataset 15 3.4 On-device Inference 16 4 Evaluation 18 4.1 Performance of Violence Detector 18 4.2 Effect of Pre-screening Module 19 4.3 Efficiency Measurement on Jeton Nano 21 5 Discussion and Future Work 23 5.1 Discussion and Future Work 23 6 Conclusion 24 6.1 Conclusion 24 Bibliography 25 Abstract in Korean 35์„
    • โ€ฆ
    corecore