32,335 research outputs found
Learning to Detect Violent Videos using Convolutional Long Short-Term Memory
Developing a technique for the automatic analysis of surveillance videos in
order to identify the presence of violence is of broad interest. In this work,
we propose a deep neural network for the purpose of recognizing violent videos.
A convolutional neural network is used to extract frame level features from a
video. The frame level features are then aggregated using a variant of the long
short term memory that uses convolutional gates. The convolutional neural
network along with the convolutional long short term memory is capable of
capturing localized spatio-temporal features which enables the analysis of
local motion taking place in the video. We also propose to use adjacent frame
differences as the input to the model thereby forcing it to encode the changes
occurring in the video. The performance of the proposed feature extraction
pipeline is evaluated on three standard benchmark datasets in terms of
recognition accuracy. Comparison of the results obtained with the state of the
art techniques revealed the promising capability of the proposed method in
recognizing violent videos.Comment: Accepted in International Conference on Advanced Video and Signal
based Surveillance(AVSS 2017
Learning Robust Object Recognition Using Composed Scenes from Generative Models
Recurrent feedback connections in the mammalian visual system have been
hypothesized to play a role in synthesizing input in the theoretical framework
of analysis by synthesis. The comparison of internally synthesized
representation with that of the input provides a validation mechanism during
perceptual inference and learning. Inspired by these ideas, we proposed that
the synthesis machinery can compose new, unobserved images by imagination to
train the network itself so as to increase the robustness of the system in
novel scenarios. As a proof of concept, we investigated whether images composed
by imagination could help an object recognition system to deal with occlusion,
which is challenging for the current state-of-the-art deep convolutional neural
networks. We fine-tuned a network on images containing objects in various
occlusion scenarios, that are imagined or self-generated through a deep
generator network. Trained on imagined occluded scenarios under the object
persistence constraint, our network discovered more subtle and localized image
features that were neglected by the original network for object classification,
obtaining better separability of different object classes in the feature space.
This leads to significant improvement of object recognition under occlusion for
our network relative to the original network trained only on un-occluded
images. In addition to providing practical benefits in object recognition under
occlusion, this work demonstrates the use of self-generated composition of
visual scenes through the synthesis loop, combined with the object persistence
constraint, can provide opportunities for neural networks to discover new
relevant patterns in the data, and become more flexible in dealing with novel
situations.Comment: Accepted by 14th Conference on Computer and Robot Visio
Deformable Part Models are Convolutional Neural Networks
Deformable part models (DPMs) and convolutional neural networks (CNNs) are
two widely used tools for visual recognition. They are typically viewed as
distinct approaches: DPMs are graphical models (Markov random fields), while
CNNs are "black-box" non-linear classifiers. In this paper, we show that a DPM
can be formulated as a CNN, thus providing a novel synthesis of the two ideas.
Our construction involves unrolling the DPM inference algorithm and mapping
each step to an equivalent (and at times novel) CNN layer. From this
perspective, it becomes natural to replace the standard image features used in
DPM with a learned feature extractor. We call the resulting model DeepPyramid
DPM and experimentally validate it on PASCAL VOC. DeepPyramid DPM significantly
outperforms DPMs based on histograms of oriented gradients features (HOG) and
slightly outperforms a comparable version of the recently introduced R-CNN
detection system, while running an order of magnitude faster
Expanded Parts Model for Semantic Description of Humans in Still Images
We introduce an Expanded Parts Model (EPM) for recognizing human attributes
(e.g. young, short hair, wearing suit) and actions (e.g. running, jumping) in
still images. An EPM is a collection of part templates which are learnt
discriminatively to explain specific scale-space regions in the images (in
human centric coordinates). This is in contrast to current models which consist
of a relatively few (i.e. a mixture of) 'average' templates. EPM uses only a
subset of the parts to score an image and scores the image sparsely in space,
i.e. it ignores redundant and random background in an image. To learn our
model, we propose an algorithm which automatically mines parts and learns
corresponding discriminative templates together with their respective locations
from a large number of candidate parts. We validate our method on three recent
challenging datasets of human attributes and actions. We obtain convincing
qualitative and state-of-the-art quantitative results on the three datasets.Comment: Accepted for publication in IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI
- …