12 research outputs found
Artificial Intelligence Enabled Methods for Human Action Recognition using Surveillance Videos
Computer vision applications have been attracting researchers and academia. It is more so with cloud computing resources enabling such applications. Analysing video surveillance applications became an important research area due to its widespread applications. For instance, CCTV camera are used in public places in order to monitor situations, identify any theft or crime instances. In presence of thousands of such surveillance videos streaming simultaneously, manual analysis is very tedious and time consuming task. There is need for automated approach for analysis and giving notifications or findings to officers concerned. It is very useful to police and investigation agencies to ascertain facts, recover evidences and even exploit digital forensics. In this context, this paper throws light on different methods of human action recognition (HAR) using machine learning (ML) and deep learning (DL) that come under Artificial Intelligence (AI). It also reviews methods on privacy preserving action recognition and Generative Adversarial Networks (GANs). This paper also provides different datasets being used for human action recognition research besides giving an account of research gaps that help in pursuing further research in the area of human action recognition
μκ³΅κ° μ£Όμμ§μ€μ κ°λ μ΄μ€ νλ¦ νλμΈμ μ κ²½λ§
νμλ
Όλ¬Έ(μμ¬)--μμΈλνκ΅ λνμ :곡과λν μ»΄ν¨ν°κ³΅νλΆ,2019. 8. μ νμ.μ€λλ νλ°ν μ¬μΈ΅ μ κ²½λ§ μ°κ΅¬μ λ°μ΄ν° μ μ₯ λ° μ²λ¦¬ κΈ°μ λ°λ¬λ‘ μΈν΄ μ΄ λ―Έμ§ λΏλ§ μλλΌ λΉλμ€μ κ°μ μκ° νλ¦μ κ°μ§ λμ©λ λ°μ΄ν°μμ λ€μν μΈμ λ¬Έμ λ₯Ό μννλ μ°κ΅¬κ° λμ± λ λ§μ κ΄μ¬μ λ°κ³ μλ€. κ·Έ μ€μμλ μ΄μ€ νλ¦ μ κ²½λ§μ μ²μμΌλ‘ μ κ²½λ§μ ν΅ν νμ΅μ΄ κΈ°μ‘΄μ μμμ
μΌλ‘ λ½μ νΉμ§λ³΄λ€ (hand- crafted features) μ’μ μ±λ₯μ 보μ¬μ€ μ΄νλ‘, λΉλμ€ νλ μΈμμμ μ£Όλ₯ μν€ν
μ³λ‘ μ리μ‘μλ€. λ³Έ λ
Όλ¬Έμμλ ν΄λΉ μν€ν
μ³λ₯Ό νμ₯νμ¬ λΉλμ€μμ λμ μΈμμ μν΄ λ
립μ μΌλ‘ νλ ¨λ μ΄μ€ νλ¦ μ κ²½λ§μ μκ³΅κ° μ£Όμμ§μ€μ μ£Όλ μν€ν
μ³λ₯Ό μ μν λ€. λ³Έ λ
Όλ¬Έμμλ cross attentionμ ν΅ν΄ κΈ°μ‘΄μ λ
립μ μΈ μ κ²½λ§μ μνΈ λ³΄μμ μΈ νμ΅μΌλ‘ μ±λ₯ ν₯μμ μ λνλ€. HMDB-51μ νμ€ λΉλμ€ νλμΈμ λ²€μΉ λ§ν¬μμ λ³Έ λ
Όλ¬Έμ μν€ν
μ³μ μ±λ₯μ μ€ννμμΌλ©°, κΈ°μ‘΄μ μν€ν
μ³λ³΄λ€ κ°μ λ μ±λ₯μ μ»μ μ μμλ€.Two-stream architecture has been mainstream since the success of [1], but two important information is processed independently and not interacted until the late fusion. We investigate a different spatio-temporal attention architecture based on two separate recognition streams (spatial and temporal), which interact with each other by cross attention. The spatial stream performs action recognition from still video frames, whilst the temporal stream is trained to recognise action from motion in the form of dense optical flow. Both streams convey their learned knowledge to the other stream in the form of attention maps. Cross attentions allow us to exploit the availability of supplemental information and enhance learning of the streams. To demonstrate the benefits of our proposed cross-stream spatio-temporal attention architecture, it has been evaluated on two standard action recognition benchmarks where it boosts the previous performance.μ μ½
μ 1 μ₯ μλ‘
μ 2 μ₯ κ΄λ ¨ μ°κ΅¬
2.1 νλ μΈμμμμ μ΄μ€ νλ¦ μ κ²½λ§
2.2 νλμΈμμμμ μ£Όμ μ§μ€(Attention)
μ 3 μ₯ μκ³΅κ° μ£Όμμ§μ€μ κ°λ μ΄μ€ νλ¦ νλμΈμ μ κ²½λ§
3.1 ν¨κ³Όμ μΈ μ£Όμμ§μ€ μΆμΆ
3.2 νλν¨ν΄ νμ΅κ³Όμ
μ 4 μ₯ μ€ν
4.1 λ°μ΄ν°μ
κ³Ό ꡬν μΈλΆμ¬ν
4.2 μ±λ₯ λΉκ΅
μ 5 μ₯ κ²°λ‘
ABSTRACTMaste
Discriminative Video Representation Learning
Representation learning is a fundamental research problem in the area of machine learning, refining the raw data to discover representations needed for various applications. However, real-world data, particularly video data, is neither mathematically nor computationally convenient to process due to its semantic redundancy and complexity. Video data, as opposed to images, includes temporal correlation and motion dynamics, but the ground truth label is normally limited to category labels, which makes the video representation learning a challenging problem. To this end, this thesis addresses the problem of video representation learning, specifically discriminative video representation learning, which focuses on capturing useful data distributions and reliable feature representations improving the performance of varied downstream tasks. We argue that neither all frames in one video nor all dimensions in one feature vector are useful and should be equally treated for video representation learning. Based on this argument, several novel algorithms are investigated in this thesis under multiple application scenarios, such as action recognition, action detection and one-class video anomaly detection. These proposed video representation learning methods produce discriminative video features in both deep and non-deep learning setups. Specifically, they are presented in the form of: 1) an early fusion layer that adopts a temporal ranking SVM formulation, agglomerating several optical flow images from consecutive frames into a novel compact representation, named as dynamic optical flow images; 2) an intermediate feature aggregation layer that applies weakly-supervised contrastive learning techniques, learning discriminative video representations via contrasting positive and negative samples from a sequence; 3) a new formulation for one-class feature learning that learns a set of discriminative subspaces with orthonormal hyperplanes to flexibly bound the one-class data distribution using Riemannian optimisation methods. We provide extensive experiments to gain intuitions into why the learned representations are discriminative and useful. All the proposed methods in this thesis are evaluated on standard publicly available benchmarks, demonstrating state-of-the-art performance