909 research outputs found
Large-Scale Mapping of Human Activity using Geo-Tagged Videos
This paper is the first work to perform spatio-temporal mapping of human
activity using the visual content of geo-tagged videos. We utilize a recent
deep-learning based video analysis framework, termed hidden two-stream
networks, to recognize a range of activities in YouTube videos. This framework
is efficient and can run in real time or faster which is important for
recognizing events as they occur in streaming video or for reducing latency in
analyzing already captured video. This is, in turn, important for using video
in smart-city applications. We perform a series of experiments to show our
approach is able to accurately map activities both spatially and temporally. We
also demonstrate the advantages of using the visual content over the
tags/titles.Comment: Accepted at ACM SIGSPATIAL 201
Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection
The increasing number of surveillance cameras and security concerns have made
automatic violent activity detection from surveillance footage an active area
for research. Modern deep learning methods have achieved good accuracy in
violence detection and proved to be successful because of their applicability
in intelligent surveillance systems. However, the models are computationally
expensive and large in size because of their inefficient methods for feature
extraction. This work presents a novel architecture for violence detection
called Two-stream Multi-dimensional Convolutional Network (2s-MDCN), which uses
RGB frames and optical flow to detect violence. Our proposed method extracts
temporal and spatial information independently by 1D, 2D, and 3D convolutions.
Despite combining multi-dimensional convolutional networks, our models are
lightweight and efficient due to reduced channel capacity, yet they learn to
extract meaningful spatial and temporal information. Additionally, combining
RGB frames and optical flow yields 2.2% more accuracy than a single RGB stream.
Regardless of having less complexity, our models obtained state-of-the-art
accuracy of 89.7% on the largest violence detection benchmark dataset.Comment: 8 pages, 6 figure
Study on Deep Learning Techniques for Finding Suspicious Violence Detection in a Video Surveillance
Detecting suspicious visual objects is essential to applying automatic violence detection (AVD) in video surveillance. Continuous monitoring of objects or any unusual things is a tedious task. Learning about video surveillance is an emerging research problem in AVD applications. Deep learning is an intelligent and trustworthy technique for detecting or classifying suspicious data objects. It classifies suspicious video frames by modeling specific categories of videos. The current deep models convolutional neural network (CNN), convolutional long-term and short-term memory (ConvLSTM), AlexNet, VGG-16, MobileNet, and GoogleNet, are wildly succeeded in real-time violence detection with the input of video clips. This paper presents the findings of experimental studies for deep models using classification measures to demonstrate the models' efficacy for our AVD application. Benchmarked violence (V), non-violence (NV), and weapon violence (WV) video datasets are used in the experiment to describe the model's performance while classifying suspicious videos for public safety
A fully integrated violence detection system using CNN and LSTM
Recently, the number of violence-related cases in places such as remote roads, pathways, shopping malls, elevators, sports stadiums, and liquor shops, has increased drastically which are unfortunately discovered only after itโs too late. The aim is to create a complete system that can perform real-time video analysis which will help recognize the presence of any violent activities and notify the same to the concerned authority, such as the police department of the corresponding area. Using the deep learning networks CNN and LSTM along with a well-defined system architecture, we have achieved an efficient solution that can be used for real-time analysis of video footage so that the concerned authority can monitor the situation through a mobile application that can notify about an occurrence of a violent event immediately
์ํฉ ํ๋จ์ ๊ธฐ๋ฐ์ผ๋ก ํ ํญ๋ ฅ ๊ฐ์ง ์ธ๊ณต ์ง๋ฅ ๋ชจ๋ธ์ ํจ์จํ: ๊ฐ์ ์นด๋ฉ๋ผ ์๋๋ฆฌ์ค๋ฅผ ์ค์ฌ์ผ๋ก
ํ์๋
ผ๋ฌธ(์์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๋ฐ์ดํฐ์ฌ์ด์ธ์ค๋ํ์ ๋ฐ์ดํฐ์ฌ์ด์ธ์คํ๊ณผ, 2023. 2. ๊นํ์ .Recently, CCTVs are installed everywhere and play an important role in crime prevention and investigation. However, there is a problem in that a huge amount of manpower is required to monitor CCTV recordings. From this point of view, deep learning (DNN) based models that can automatically detect violence have been developed. However, they used heavy architectures such as 3D convolution or LSTM to process video data. For this reason, they require offloading to the central server for recordings to be processed so that incur huge transmission cost and privacy concern. Furthermore, given violence does not occur frequently, it is inefficient to run heavy video recognition model all the time.
To solve these problems, this study proposes WhenToWatch, to enhance efficiency of violence detection system on surveillance camera. Main goals of this study are as follows: (1) To devise DNN-based violence detection system fully run on the CCTV devices to avoid offloading cost and privacy issues. (2) To reduce energy consumption of the device and processing time by introducing pre-screening module checking existence of people and deciding whether violence detection model should be executed or not. (3) To minimize computation overhead of the pre-screening module by combining lightweight non-DNN based methods and executing them according to previous status. In conclusion, WhenToWatch can be helpful when running violence detection models on edge devices such as CCTV, where power and computing resources are limited.
Experiments show that WhenToWatch can reduce the execution of the violence detection model by 17% on the RWF-2000 dataset and 31% on the CCTV-Busan dataset. In addition, WhenToWatch reduces average processing time per a video from 310.46 seconds to 255.60 seconds and average power consumption from 3,303mW to 3,100mW on Jetson Nano, confirming it contributes to efficient on-device system operation.์ต๊ทผ์๋ ์์ ์ ์ํด CCTV๊ฐ ๊ณณ๊ณณ์ ์ค์น๋์ด ์์ผ๋ฉฐ ๋ฒ์ฃ ์๋ฐฉ ๋ฐ ์์ฌ์ ์ค์ํ ์ญํ ์ ํ๊ณ ์๋ค. ๊ทธ๋ฌ๋ CCTV ์์๋ค์ ์ค์๊ฐ์ผ๋ก ๊ฐ์ํ๊ฑฐ๋ ๋
นํ๋ ์์์ ์ฌ๊ฒํ ํ๊ธฐ ์ํด์๋ ๋ง๋ํ ์ธ๋ ฅ์ด ํ์ํ๋ค๋ ๋ฌธ์ ์ ์ด ์๋ค. ์ด๋ฌํ ๊ด์ ์์ ์๋์ผ๋ก ํญ๋ ฅ์ ๊ฐ์งํ ์ ์๋ ๋ฅ๋ฌ๋ ๋ชจ๋ธ๋ค์ด ๊พธ์คํ ๊ฐ๋ฐ๋์ด์๋ค. ๊ทธ๋ฌ๋ ๋๋ถ๋ถ์ ๋ชจ๋ธ์ 3D ์ปจ๋ณผ๋ฃจ์
, LSTM ๋ฑ์ ๋ฌด๊ฑฐ์ด ์์์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ฌ์ฉํ๊ธฐ ๋๋ฌธ์ CCTV ๋๋ฐ์ด์ค ๋ด์์์ ์ถ๋ก ์ ๊ฑฐ์ ๋ถ๊ฐ๋ฅํ๊ณ , ์๋ฒ๋ก ์์์ ์ ์กํ์ฌ ์ฒ๋ฆฌํ๋ ๊ฒ์ ์ ์ ๋ก ํ๋ค. ์ด ๊ฒฝ์ฐ ๋ง๋ํ ์ ์ก ๋น์ฉ์ด ๋ฐ์ํ ๋ฟ๋ง ์๋๋ผ ์ฌ์ํ ์นจํด ๋ฌธ์ ๊ฐ ๋ฐ์ํ ์์ง๊ฐ ์๋ค. ๋ฟ๋ง ์๋๋ผ, ํญ๋ ฅ์ ์ผ๋ฐ์ ์ธ ์ฌ๊ฑด์ ๋นํด ๋ฐ์ ๋น๋๊ฐ ๋ฎ๋ค๋ ์ ์ ๊ณ ๋ คํ๋ค๋ฉด CCTV ๋์ ์๊ฐ ๋ด๋ด ๋ฌด๊ฑฐ์ด ํญ๋ ฅ ๊ฐ์ง ๋ชจ๋ธ์ ๊ตฌ๋ํ๋ ๊ฒ์ ๋นํจ์จ์ ์ด๋ผ๊ณ ํ ์ ์๋ค.
์ด๋ฌํ ๋ฌธ์ ์ ๋ค์ ํด๊ฒฐํ๊ณ ํญ๋ ฅ ๊ฐ์ง ์์คํ
์ ํจ์จ์ฑ์ ์ ๊ณ ํ๊ธฐ ์ํด ๋ณธ ์ฐ๊ตฌ์์๋ WhenToWatch๋ผ๋ ํญ๋ ฅ ๊ฐ์ง ์์คํ
์ ์ ์ํ๋ค. ๋ณธ ์ฐ๊ตฌ์ ์ฃผ์ ๋ชฉ์ ์ ๋ค์๊ณผ ๊ฐ๋ค. (1) ๋ฐ์ดํฐ ์ ์ก ๋น์ฉ์ ์ต์ํํ๊ณ ๊ฐ์ธ์ ๋ณด๋ฅผ ๋ณดํธํ๊ธฐ ์ํด ๊ฐ์์นด๋ฉ๋ผ ์ฅ์น ๋ด์์ ๊ตฌ๋ ๊ฐ๋ฅํ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ์ ํญ๋ ฅ ๊ฐ์ง ์์คํ
์ ์ ์ํ๋ค. (2) ๊ฐ์์นด๋ฉ๋ผ ์ฅ์น์ ์ ๋ ฅ ์๋ชจ๋๊ณผ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์๊ฐ์ ์ค์ด๊ธฐ ์ํด ์ฌ์ ํ๋จ ๋ชจ๋์ ๋์
ํ๋ค. ์ด๋ฅผ ํตํด ์ฌ๋์ ์กด์ฌ ์ฌ๋ถ๋ฅผ ํ๋จํ๊ณ ํญ๋ ฅ ๊ฐ์ง ๋ชจ๋ธ์ ์คํ ์ฌ๋ถ๋ฅผ ๊ฒฐ์ ํจ์ผ๋ก์จ ๋ถํ์ํ ์ฐ์ฐ๋์ ์ค์ผ ์ ์๋ค. (3) ์ฌ์ ํ๋จ ๋ชจ๋๋ก ์ธํ ์ถ๊ฐ์ ์ธ ์ฐ์ฐ๋ ๋ถ๋ด์ ์ต์ํํ๊ธฐ ์ํด ์คํ์๋๊ฐ ๋น ๋ฅธ ๋น ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ์ ๋ฐฉ๋ฒ๋ก ๋ค์ ๊ฒฐํฉํ ์์คํ
์ ๋์์ธํ๊ณ , ์ด์ ์ํ์ ๋ฐ๋ผ ์ ์ ํ ์ฐ์ฐ์ ์คํํ๋ค. ์ต์ข
์ ์ผ๋ก WhenToWatch๋ CCTV์ ๊ฐ์ด ๋ฆฌ์์ค๊ฐ ์ ํ๋ ์ฃ์ง ๋๋ฐ์ด์ค์์ ํญ๋ ฅ ๊ฐ์ง ๋ชจ๋ธ์ ํจ์จ์ ์ผ๋ก ๊ตฌ๋ํ ์ ์๊ฒ ํ๋ค.
์ค์ ์คํ ๊ฒฐ๊ณผ, ์ ์๋ ์ฌ์ ํ๋จ ๋ชจ๋์ ์ ์ฉํ์ ๋, ํญ๋ ฅ ๊ฐ์ง ๋ชจ๋ธ์ ์คํ ํ์๋ RWF-2000 ๋ฐ์ดํฐ์
์์ ์ฝ 17% ๊ฐ์ํ์ผ๋ฉฐ CCTV-Busan ๋ฐ์ดํฐ์
์์๋ ์ฝ 31% ๊ฐ์ํ๋ ๊ฒ์ผ๋ก ๋ํ๋ฌ๋ค. ๋ณธ ๋
ผ๋ฌธ์ ์์คํ
๊ตฌ์กฐ๋ฅผ ํตํด ๋ณด๋ค ํจ์จ์ ์ธ ์์คํ
์ด์์ด ๊ฐ๋ฅํจ์ ํ์ธํ ์ ์์๋ค. ๋ํ ์ ฏ์จ ๋๋
ธ์์ ํ๊ท ๋น๋์ค ์ฒ๋ฆฌ ์๊ฐ์ 310.46์ด์์ 255.60์ด๋ก ๊ฐ์ํ์์ผ๋ฉฐ ์ ๋ ฅ ์๋ชจ๋์ 3,303mW์์ 3,100mW๋ก ๊ฐ์ํ์ฌ WhenToWatch๊ฐ ํจ์จ์ ์ธ ์จ๋๋ฐ์ด์ค ์์คํ
์ด์์ ๊ธฐ์ฌํ ์ ์์์ ๋ณด์ฌ์ฃผ์๋ค.1 Introduction 1
2 Related Work 6
2.1 Violence Detection 6
2.2 Edge AI 7
2.3 Early-skipping in Neural Networks 8
3 Methodology 9
3.1 WhenToWatch Overview 9
3.2 Implementation Details of Sub-modules 12
3.3 Dataset 15
3.4 On-device Inference 16
4 Evaluation 18
4.1 Performance of Violence Detector 18
4.2 Effect of Pre-screening Module 19
4.3 Efficiency Measurement on Jeton Nano 21
5 Discussion and Future Work 23
5.1 Discussion and Future Work 23
6 Conclusion 24
6.1 Conclusion 24
Bibliography 25
Abstract in Korean 35์
- โฆ