Developing a technique for the automatic analysis of surveillance videos in
order to identify the presence of violence is of broad interest. In this work,
we propose a deep neural network for the purpose of recognizing violent videos.
A convolutional neural network is used to extract frame level features from a
video. The frame level features are then aggregated using a variant of the long
short term memory that uses convolutional gates. The convolutional neural
network along with the convolutional long short term memory is capable of
capturing localized spatio-temporal features which enables the analysis of
local motion taking place in the video. We also propose to use adjacent frame
differences as the input to the model thereby forcing it to encode the changes
occurring in the video. The performance of the proposed feature extraction
pipeline is evaluated on three standard benchmark datasets in terms of
recognition accuracy. Comparison of the results obtained with the state of the
art techniques revealed the promising capability of the proposed method in
recognizing violent videos.Comment: Accepted in International Conference on Advanced Video and Signal
based Surveillance(AVSS 2017