6,478 research outputs found

    Learning to Detect Violent Videos using Convolutional Long Short-Term Memory

    Full text link
    Developing a technique for the automatic analysis of surveillance videos in order to identify the presence of violence is of broad interest. In this work, we propose a deep neural network for the purpose of recognizing violent videos. A convolutional neural network is used to extract frame level features from a video. The frame level features are then aggregated using a variant of the long short term memory that uses convolutional gates. The convolutional neural network along with the convolutional long short term memory is capable of capturing localized spatio-temporal features which enables the analysis of local motion taking place in the video. We also propose to use adjacent frame differences as the input to the model thereby forcing it to encode the changes occurring in the video. The performance of the proposed feature extraction pipeline is evaluated on three standard benchmark datasets in terms of recognition accuracy. Comparison of the results obtained with the state of the art techniques revealed the promising capability of the proposed method in recognizing violent videos.Comment: Accepted in International Conference on Advanced Video and Signal based Surveillance(AVSS 2017

    Vision-based Fight Detection from Surveillance Cameras

    Full text link
    Vision-based action recognition is one of the most challenging research topics of computer vision and pattern recognition. A specific application of it, namely, detecting fights from surveillance cameras in public areas, prisons, etc., is desired to quickly get under control these violent incidents. This paper addresses this research problem and explores LSTM-based approaches to solve it. Moreover, the attention layer is also utilized. Besides, a new dataset is collected, which consists of fight scenes from surveillance camera videos available at YouTube. This dataset is made publicly available. From the extensive experiments conducted on Hockey Fight, Peliculas, and the newly collected fight datasets, it is observed that the proposed approach, which integrates Xception model, Bi-LSTM, and attention, improves the state-of-the-art accuracy for fight scene classification.Comment: 6 pages, 5 figures, 4 tables, International Conference on Image Processing Theory, Tools and Applications, IPTA 201

    Spatio-temporal action localization with Deep Learning

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaThe system that detects and identifies human activities are named human action recognition. On the video approach, human activity is classified into four different categories, depending on the complexity of the steps and the number of body parts involved in the action, namely gestures, actions, interactions, and activities, which is challenging for video Human action recognition to capture valuable and discriminative features because of the human body’s variations. So, deep learning techniques have provided practical applications in multiple fields of signal processing, usually surpassing traditional signal processing on a large scale. Recently, several applications, namely surveillance, human-computer interaction, and video recovery based on its content, have studied violence’s detection and recognition. In recent years there has been a rapid growth in the production and consumption of a wide variety of video data due to the popularization of high quality and relatively low-price video devices. Smartphones and digital cameras contributed a lot to this factor. At the same time, there are about 300 hours of video data updates every minute on YouTube. Along with the growing production of video data, new technologies such as video captioning, answering video surveys, and video-based activity/event detection are emerging every day. From the video input data, the detection of human activity indicates which activity is contained in the video and locates the regions in the video where the activity occurs. This dissertation has conducted an experiment to identify and detect violence with spatial action localization, adapting a public dataset for effect. The idea was used an annotated dataset of general action recognition and adapted only for violence detection.O sistema que deteta e identifica as atividades humanas é denominado reconhecimento da ação humana. Na abordagem por vídeo, a atividade humana é classificada em quatro categorias diferentes, dependendo da complexidade das etapas e do número de partes do corpo envolvidas na ação, a saber, gestos, ações, interações e atividades, o que é desafiador para o reconhecimento da ação humana do vídeo para capturar características valiosas e discriminativas devido às variações do corpo humano. Portanto, as técnicas de deep learning forneceram aplicações práticas em vários campos de processamento de sinal, geralmente superando o processamento de sinal tradicional em grande escala. Recentemente, várias aplicações, nomeadamente na vigilância, interação humano computador e recuperação de vídeo com base no seu conteúdo, estudaram a deteção e o reconhecimento da violência. Nos últimos anos, tem havido um rápido crescimento na produção e consumo de uma ampla variedade de dados de vídeo devido à popularização de dispositivos de vídeo de alta qualidade e preços relativamente baixos. Smartphones e cameras digitais contribuíram muito para esse fator. Ao mesmo tempo, há cerca de 300 horas de atualizações de dados de vídeo a cada minuto no YouTube. Junto com a produção crescente de dados de vídeo, novas tecnologias, como legendagem de vídeo, respostas a pesquisas de vídeo e deteção de eventos / atividades baseadas em vídeo estão surgindo todos os dias. A partir dos dados de entrada de vídeo, a deteção de atividade humana indica qual atividade está contida no vídeo e localiza as regiões no vídeo onde a atividade ocorre. Esta dissertação conduziu uma experiência para identificar e detetar violência com localização espacial, adaptando um dataset público para efeito. A ideia foi usada um conjunto de dados anotado de reconhecimento de ações gerais e adaptá-la apenas para deteção de violência

    Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling

    Full text link
    This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.Comment: CVPR 202

    Feature fusion based deep spatiotemporal model for violence detection in videos

    Full text link
    © Springer Nature Switzerland AG 2019. It is essential for public monitoring and security to detect violent behavior in surveillance videos. However, it requires constant human observation and attention, which is a challenging task. Autonomous detection of violent activities is essential for continuous, uninterrupted video surveillance systems. This paper proposed a novel method to detect violent activities in videos, using fused spatial feature maps, based on Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) units. The spatial features are extracted through CNN, and multi-level spatial features fusion method is used to combine the spatial features maps from two equally spaced sequential input video frames to incorporate motion characteristics. The additional residual layer blocks are used to further learn these fused spatial features to increase the classification accuracy of the network. The combined spatial features of input frames are then fed to LSTM units to learn the global temporal information. The output of this network classifies the violent or non-violent category present in the input video frame. Experimental results on three different standard benchmark datasets: Hockey Fight, Crowd Violence and BEHAVE show that the proposed algorithm provides better ability to recognize violent actions in different scenarios and results in improved performance compared to the state-of-the-art methods

    Inflated 3D ConvNet context analysis for violence detection

    Get PDF
    According to the Wall Street Journal, one billion surveillance cameras will be deployed around the world by 2021. This amount of information can be hardly managed by humans. Using a Inflated 3D ConvNet as backbone, this paper introduces a novel automatic violence detection approach that outperforms state-of-the-art existing proposals. Most of those proposals consider a pre-processing step to only focus on some regions of interest in the scene, i.e., those actually containing a human subject. In this regard, this paper also reports the results of an extensive analysis on whether and how the context can affect or not the adopted classifier performance. The experiments show that context-free footage yields substantial deterioration of the classifier performance (2% to 5%) on publicly available datasets. However, they also demonstrate that performance stabilizes in context-free settings, no matter the level of context restriction applied. Finally, a cross-dataset experiment investigates the generalizability of results obtained in a single-collection experiment (same dataset used for training and testing) to cross-collection settings (different datasets used for training and testing)

    Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection

    Full text link
    The increasing number of surveillance cameras and security concerns have made automatic violent activity detection from surveillance footage an active area for research. Modern deep learning methods have achieved good accuracy in violence detection and proved to be successful because of their applicability in intelligent surveillance systems. However, the models are computationally expensive and large in size because of their inefficient methods for feature extraction. This work presents a novel architecture for violence detection called Two-stream Multi-dimensional Convolutional Network (2s-MDCN), which uses RGB frames and optical flow to detect violence. Our proposed method extracts temporal and spatial information independently by 1D, 2D, and 3D convolutions. Despite combining multi-dimensional convolutional networks, our models are lightweight and efficient due to reduced channel capacity, yet they learn to extract meaningful spatial and temporal information. Additionally, combining RGB frames and optical flow yields 2.2% more accuracy than a single RGB stream. Regardless of having less complexity, our models obtained state-of-the-art accuracy of 89.7% on the largest violence detection benchmark dataset.Comment: 8 pages, 6 figure

    Video Vision Transformers for Violence Detection

    Full text link
    Law enforcement and city safety are significantly impacted by detecting violent incidents in surveillance systems. Although modern (smart) cameras are widely available and affordable, such technological solutions are impotent in most instances. Furthermore, personnel monitoring CCTV recordings frequently show a belated reaction, resulting in the potential cause of catastrophe to people and property. Thus automated detection of violence for swift actions is very crucial. The proposed solution uses a novel end-to-end deep learning-based video vision transformer (ViViT) that can proficiently discern fights, hostile movements, and violent events in video sequences. The study presents utilizing a data augmentation strategy to overcome the downside of weaker inductive biasness while training vision transformers on a smaller training datasets. The evaluated results can be subsequently sent to local concerned authority, and the captured video can be analyzed. In comparison to state-of-theart (SOTA) approaches the proposed method achieved auspicious performance on some of the challenging benchmark datasets

    A machine learning approach to detect violent behaviour from video

    Get PDF
    The automatic classification of violent actions performed by two or more persons is an important task for both societal and scientific purposes. In this paper, we propose a machine learning approach, based a Support Vector Machine (SVM), to detect if a human action, captured on a video, is or not violent. Using a pose estimation algorithm, we focus mostly on feature engineering, to generate the SVM inputs. In particular, we hand-engineered a set of input features based on keypoints (angles, velocity and contact detection) and used them, under distinct combinations, to study their effect on violent behavior recognition from video. Overall, an excellent classification was achieved by the best performing SVM model, which used keypoints, angles and contact features computed over a 60 frame image input range.Fundação para a Ciência e a Tecnologia (UID/CEC/00319/2013
    corecore