15 research outputs found
Artificial Intelligence Enabled Methods for Human Action Recognition using Surveillance Videos
Computer vision applications have been attracting researchers and academia. It is more so with cloud computing resources enabling such applications. Analysing video surveillance applications became an important research area due to its widespread applications. For instance, CCTV camera are used in public places in order to monitor situations, identify any theft or crime instances. In presence of thousands of such surveillance videos streaming simultaneously, manual analysis is very tedious and time consuming task. There is need for automated approach for analysis and giving notifications or findings to officers concerned. It is very useful to police and investigation agencies to ascertain facts, recover evidences and even exploit digital forensics. In this context, this paper throws light on different methods of human action recognition (HAR) using machine learning (ML) and deep learning (DL) that come under Artificial Intelligence (AI). It also reviews methods on privacy preserving action recognition and Generative Adversarial Networks (GANs). This paper also provides different datasets being used for human action recognition research besides giving an account of research gaps that help in pursuing further research in the area of human action recognition
Video-Based Human Activity Recognition Using Deep Learning Approaches
Due to its capacity to gather vast, high-level data about human activity from wearable or stationary sensors, human activity recognition substantially impacts peopleβs day-to-day lives. Multiple people and things may be seen acting in the video, dispersed throughout the frame in various places. Because of this, modeling the interactions between many entities in spatial dimensions is necessary for visual reasoning in the action recognition task. The main aim of this paper is to evaluate and map the current scenario of human actions in red, green, and blue videos, based on deep learning models. A residual network (ResNet) and a vision transformer architecture (ViT) with a semi-supervised learning approach are evaluated. The DINO (self-DIstillation with NO labels) is used to enhance the potential of the ResNet and ViT. The evaluated benchmark is the human motion database (HMDB51), which tries to better capture the richness and complexity of human actions. The obtained results for video classification with the proposed ViT are promising based on performance metrics and results from the recent literature. The results obtained using a bi-dimensional ViT with long short-term memory demonstrated great performance in human action recognition when applied to the HMDB51 dataset. The mentioned architecture presented 96.7 Β± 0.35% and 41.0 Β± 0.27% in terms of accuracy (mean Β± standard deviation values) in the train and test phases of the HMDB51 dataset, respectively
Texture-Based Input Feature Selection for Action Recognition
The performance of video action recognition has been significantly boosted by
using motion representations within a two-stream Convolutional Neural Network
(CNN) architecture. However, there are a few challenging problems in action
recognition in real scenarios, e.g., the variations in viewpoints and poses,
and the changes in backgrounds. The domain discrepancy between the training
data and the test data causes the performance drop. To improve the model
robustness, we propose a novel method to determine the task-irrelevant content
in inputs which increases the domain discrepancy. The method is based on a
human parsing model (HP model) which jointly conducts dense correspondence
labelling and semantic part segmentation. The predictions from the HP model
also function as re-rendering the human regions in each video using the same
set of textures to make humans appearances in all classes be the same. A
revised dataset is generated for training and testing and makes the action
recognition model exhibit invariance to the irrelevant content in the inputs.
Moreover, the predictions from the HP model are used to enrich the inputs to
the AR model during both training and testing. Experimental results show that
our proposed model is superior to existing models for action recognition on the
HMDB-51 dataset and the Penn Action dataset
μκ³΅κ° μ£Όμμ§μ€μ κ°λ μ΄μ€ νλ¦ νλμΈμ μ κ²½λ§
νμλ
Όλ¬Έ(μμ¬)--μμΈλνκ΅ λνμ :곡과λν μ»΄ν¨ν°κ³΅νλΆ,2019. 8. μ νμ.μ€λλ νλ°ν μ¬μΈ΅ μ κ²½λ§ μ°κ΅¬μ λ°μ΄ν° μ μ₯ λ° μ²λ¦¬ κΈ°μ λ°λ¬λ‘ μΈν΄ μ΄ λ―Έμ§ λΏλ§ μλλΌ λΉλμ€μ κ°μ μκ° νλ¦μ κ°μ§ λμ©λ λ°μ΄ν°μμ λ€μν μΈμ λ¬Έμ λ₯Ό μννλ μ°κ΅¬κ° λμ± λ λ§μ κ΄μ¬μ λ°κ³ μλ€. κ·Έ μ€μμλ μ΄μ€ νλ¦ μ κ²½λ§μ μ²μμΌλ‘ μ κ²½λ§μ ν΅ν νμ΅μ΄ κΈ°μ‘΄μ μμμ
μΌλ‘ λ½μ νΉμ§λ³΄λ€ (hand- crafted features) μ’μ μ±λ₯μ 보μ¬μ€ μ΄νλ‘, λΉλμ€ νλ μΈμμμ μ£Όλ₯ μν€ν
μ³λ‘ μ리μ‘μλ€. λ³Έ λ
Όλ¬Έμμλ ν΄λΉ μν€ν
μ³λ₯Ό νμ₯νμ¬ λΉλμ€μμ λμ μΈμμ μν΄ λ
립μ μΌλ‘ νλ ¨λ μ΄μ€ νλ¦ μ κ²½λ§μ μκ³΅κ° μ£Όμμ§μ€μ μ£Όλ μν€ν
μ³λ₯Ό μ μν λ€. λ³Έ λ
Όλ¬Έμμλ cross attentionμ ν΅ν΄ κΈ°μ‘΄μ λ
립μ μΈ μ κ²½λ§μ μνΈ λ³΄μμ μΈ νμ΅μΌλ‘ μ±λ₯ ν₯μμ μ λνλ€. HMDB-51μ νμ€ λΉλμ€ νλμΈμ λ²€μΉ λ§ν¬μμ λ³Έ λ
Όλ¬Έμ μν€ν
μ³μ μ±λ₯μ μ€ννμμΌλ©°, κΈ°μ‘΄μ μν€ν
μ³λ³΄λ€ κ°μ λ μ±λ₯μ μ»μ μ μμλ€.Two-stream architecture has been mainstream since the success of [1], but two important information is processed independently and not interacted until the late fusion. We investigate a different spatio-temporal attention architecture based on two separate recognition streams (spatial and temporal), which interact with each other by cross attention. The spatial stream performs action recognition from still video frames, whilst the temporal stream is trained to recognise action from motion in the form of dense optical flow. Both streams convey their learned knowledge to the other stream in the form of attention maps. Cross attentions allow us to exploit the availability of supplemental information and enhance learning of the streams. To demonstrate the benefits of our proposed cross-stream spatio-temporal attention architecture, it has been evaluated on two standard action recognition benchmarks where it boosts the previous performance.μ μ½
μ 1 μ₯ μλ‘
μ 2 μ₯ κ΄λ ¨ μ°κ΅¬
2.1 νλ μΈμμμμ μ΄μ€ νλ¦ μ κ²½λ§
2.2 νλμΈμμμμ μ£Όμ μ§μ€(Attention)
μ 3 μ₯ μκ³΅κ° μ£Όμμ§μ€μ κ°λ μ΄μ€ νλ¦ νλμΈμ μ κ²½λ§
3.1 ν¨κ³Όμ μΈ μ£Όμμ§μ€ μΆμΆ
3.2 νλν¨ν΄ νμ΅κ³Όμ
μ 4 μ₯ μ€ν
4.1 λ°μ΄ν°μ
κ³Ό ꡬν μΈλΆμ¬ν
4.2 μ±λ₯ λΉκ΅
μ 5 μ₯ κ²°λ‘
ABSTRACTMaste