17 research outputs found
Learning to Detect Violent Videos using Convolutional Long Short-Term Memory
Developing a technique for the automatic analysis of surveillance videos in
order to identify the presence of violence is of broad interest. In this work,
we propose a deep neural network for the purpose of recognizing violent videos.
A convolutional neural network is used to extract frame level features from a
video. The frame level features are then aggregated using a variant of the long
short term memory that uses convolutional gates. The convolutional neural
network along with the convolutional long short term memory is capable of
capturing localized spatio-temporal features which enables the analysis of
local motion taking place in the video. We also propose to use adjacent frame
differences as the input to the model thereby forcing it to encode the changes
occurring in the video. The performance of the proposed feature extraction
pipeline is evaluated on three standard benchmark datasets in terms of
recognition accuracy. Comparison of the results obtained with the state of the
art techniques revealed the promising capability of the proposed method in
recognizing violent videos.Comment: Accepted in International Conference on Advanced Video and Signal
based Surveillance(AVSS 2017
Physical Violence Detection System to Prevent Student Mental Health Disorders Based on Deep Learning
Physical violence in the educational environment by students often occurs and leads to criminal acts. Apart from that, repeated acts of physical violence can be considered non-verbal bullying. This bullying can hurt the victim, causing physical disorders, mental health, impaired social relationships and decreased academic performance. However, monitoring activities against acts of violence currently being carried out have weaknesses, namely weak supervision by the school. A deep Learning-based physical violence detection system, namely LSTM Network, is the solution to this problem. In this research, we develop a Convolutional Neural Network to detect acts of violence. Convolutional Neural Network extracts features at the frame level from videos. At the frame level, the feature uses long short-term memory in the convolutional gate. Convolutional Neural Networks and convolutional short-term memory can capture local spatio-temporal features, enabling local video motion analysis. The performance of the proposed feature extraction pipeline is evaluated on standard benchmark datasets in terms of recognition accuracy. A comparison of the results obtained with state-of-the-art techniques reveals the promising capabilities of the proposed method for recognising violent videos. The model that has been trained and tested will be integrated into a violence detection system, which can provide ease and speed in detecting acts of violence that occur in the school environment
A fully integrated violence detection system using CNN and LSTM
Recently, the number of violence-related cases in places such as remote roads, pathways, shopping malls, elevators, sports stadiums, and liquor shops, has increased drastically which are unfortunately discovered only after it’s too late. The aim is to create a complete system that can perform real-time video analysis which will help recognize the presence of any violent activities and notify the same to the concerned authority, such as the police department of the corresponding area. Using the deep learning networks CNN and LSTM along with a well-defined system architecture, we have achieved an efficient solution that can be used for real-time analysis of video footage so that the concerned authority can monitor the situation through a mobile application that can notify about an occurrence of a violent event immediately
Vision-based Fight Detection from Surveillance Cameras
Vision-based action recognition is one of the most challenging research
topics of computer vision and pattern recognition. A specific application of
it, namely, detecting fights from surveillance cameras in public areas,
prisons, etc., is desired to quickly get under control these violent incidents.
This paper addresses this research problem and explores LSTM-based approaches
to solve it. Moreover, the attention layer is also utilized. Besides, a new
dataset is collected, which consists of fight scenes from surveillance camera
videos available at YouTube. This dataset is made publicly available. From the
extensive experiments conducted on Hockey Fight, Peliculas, and the newly
collected fight datasets, it is observed that the proposed approach, which
integrates Xception model, Bi-LSTM, and attention, improves the
state-of-the-art accuracy for fight scene classification.Comment: 6 pages, 5 figures, 4 tables, International Conference on Image
Processing Theory, Tools and Applications, IPTA 201
Violence Detection in Social Media-Review
Social media has become a vital part of humans’ day to day life. Different users engage with social media differently. With the increased usage of social media, many researchers have investigated different aspects of social media. Many examples in the recent past show, content in the social media can generate violence in the user community. Violence in social media can be categorised into aggregation in comments, cyber-bullying and incidents like protests, murders. Identifying violent content in social media is a challenging task: social media posts contain both the visual and text as well as these posts may contain hidden meaning according to the users’ context and other background information. This paper summarizes the different social media violent categories and existing methods to detect the violent content.Keywords: Machine learning, natural language processing, violence, social media, convolution neural networ
Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos
Most recent approaches for action recognition from video leverage deep
architectures to encode the video clip into a fixed length representation
vector that is then used for classification. For this to be successful, the
network must be capable of suppressing irrelevant scene background and extract
the representation from the most discriminative part of the video. Our
contribution builds on the observation that spatio-temporal patterns
characterizing actions in videos are highly correlated with objects and their
location in the video. We propose Top-down Attention Action VLAD (TA-VLAD), a
deep recurrent architecture with built-in spatial attention that performs
temporally aggregated VLAD encoding for action recognition from videos. We
adopt a top-down approach of attention, by using class specific activation maps
obtained from a deep CNN pre-trained for image classification, to weight
appearance features before encoding them into a fixed-length video descriptor
using Gated Recurrent Units. Our method achieves state of the art recognition
accuracy on HMDB51 and UCF101 benchmarks.Comment: Accepted to the 17th International Conference of the Italian
Association for Artificial Intelligenc
Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition
In this paper we propose an end-to-end trainable deep neural network model
for egocentric activity recognition. Our model is built on the observation that
egocentric activities are highly characterized by the objects and their
locations in the video. Based on this, we develop a spatial attention mechanism
that enables the network to attend to regions containing objects that are
correlated with the activity under consideration. We learn highly specialized
attention maps for each frame using class-specific activations from a CNN
pre-trained for generic image recognition, and use them for spatio-temporal
encoding of the video with a convolutional LSTM. Our model is trained in a
weakly supervised setting using raw video-level activity-class labels.
Nonetheless, on standard egocentric activity benchmarks our model surpasses by
up to +6% points recognition accuracy the currently best performing method that
leverages hand segmentation and object location strong supervision for
training. We visually analyze attention maps generated by the network,
revealing that the network successfully identifies the relevant objects present
in the video frames which may explain the strong recognition performance. We
also discuss an extensive ablation analysis regarding the design choices.Comment: Accepted to BMVC 201