10,859 research outputs found
A Bag-of-Words Equivalent Recurrent Neural Network for Action Recognition
The traditional bag-of-words approach has found a wide range of applications
in computer vision. The standard pipeline consists of a generation of a visual
vocabulary, a quantization of the features into histograms of visual words, and
a classification step for which usually a support vector machine in combination
with a non-linear kernel is used. Given large amounts of data, however, the
model suffers from a lack of discriminative power. This applies particularly
for action recognition, where the vast amount of video features needs to be
subsampled for unsupervised visual vocabulary generation. Moreover, the kernel
computation can be very expensive on large datasets. In this work, we propose a
recurrent neural network that is equivalent to the traditional bag-of-words
approach but enables for the application of discriminative training. The model
further allows to incorporate the kernel computation into the neural network
directly, solving the complexity issue and allowing to represent the complete
classification system within a single network. We evaluate our method on four
recent action recognition benchmarks and show that the conventional model as
well as sparse coding methods are outperformed
Bag of Genres for Video Retrieval
Often, videos are composed of multiple concepts or even genres. For instance,
news videos may contain sports, action, nature, etc. Therefore, encoding the
distribution of such concepts/genres in a compact and effective representation
is a challenging task. In this sense, we propose the Bag of Genres
representation, which is based on a visual dictionary defined by a genre
classifier. Each visual word corresponds to a region in the classification
space. The Bag of Genres video vector contains a summary of the activations of
each genre in the video content. We evaluate the proposed method for video
genre retrieval using the dataset of MediaEval Tagging Task of 2012 and for
video event retrieval using the EVVE dataset. Results show that the proposed
method achieves results comparable or superior to state-of-the-art methods,
with the advantage of providing a much more compact representation than
existing features
Hyper-Fisher Vectors for Action Recognition
In this paper, a novel encoding scheme combining Fisher vector and
bag-of-words encodings has been proposed for recognizing action in videos. The
proposed Hyper-Fisher vector encoding is sum of local Fisher vectors which are
computed based on the traditional Bag-of-Words (BoW) encoding. Thus, the
proposed encoding is simple and yet an effective representation over the
traditional Fisher Vector encoding. By extensive evaluation on challenging
action recognition datasets, viz., Youtube, Olympic Sports, UCF50 and HMDB51,
we show that the proposed Hyper-Fisher Vector encoding improves the recognition
performance by around 2-3% compared to the improved Fisher Vector encoding. We
also perform experiments to show that the performance of the Hyper-Fisher
Vector is robust to the dictionary size of the BoW encoding
Large-Scale YouTube-8M Video Understanding with Deep Neural Networks
Video classification problem has been studied many years. The success of
Convolutional Neural Networks (CNN) in image recognition tasks gives a powerful
incentive for researchers to create more advanced video classification
approaches. As video has a temporal content Long Short Term Memory (LSTM)
networks become handy tool allowing to model long-term temporal clues. Both
approaches need a large dataset of input data. In this paper three models
provided to address video classification using recently announced YouTube-8M
large-scale dataset. The first model is based on frame pooling approach. Two
other models based on LSTM networks. Mixture of Experts intermediate layer is
used in third model allowing to increase model capacity without dramatically
increasing computations. The set of experiments for handling imbalanced
training data has been conducted.Comment: 6 pages, 5 figures, 3 table
CNN-VWII: An Efficient Approach for Large-Scale Video Retrieval by Image Queries
This paper aims to solve the problem of large-scale video retrieval by a
query image. Firstly, we define the problem of top- image to video query.
Then, we combine the merits of convolutional neural networks(CNN for short) and
Bag of Visual Word(BoVW for short) module to design a model for video frames
information extraction and representation. In order to meet the requirements of
large-scale video retrieval, we proposed a visual weighted inverted index(VWII
for short) and related algorithm to improve the efficiency and accuracy of
retrieval process. Comprehensive experiments show that our proposed technique
achieves substantial improvements (up to an order of magnitude speed up) over
the state-of-the-art techniques with similar accuracy.Comment: submitted to Pattern Recognition Letter
Spatiotemporal CNNs for Pornography Detection in Videos
With the increasing use of social networks and mobile devices, the number of
videos posted on the Internet is growing exponentially. Among the inappropriate
contents published on the Internet, pornography is one of the most worrying as
it can be accessed by teens and children. Two spatiotemporal CNNs, VGG-C3D CNN
and ResNet R(2+1)D CNN, were assessed for pornography detection in videos in
the present study. Experimental results using the Pornography-800 dataset
showed that these spatiotemporal CNNs performed better than some
state-of-the-art methods based on bag of visual words and are competitive with
other CNN-based approaches, reaching accuracy of 95.1%
Beyond Short Snippets: Deep Networks for Video Classification
Convolutional neural networks (CNNs) have been extensively applied for image
recognition problems giving state-of-the-art results on recognition, detection,
segmentation and retrieval. In this work we propose and evaluate several deep
neural network architectures to combine image information across a video over
longer time periods than previously attempted. We propose two methods capable
of handling full length videos. The first method explores various convolutional
temporal feature pooling architectures, examining the various design choices
which need to be made when adapting a CNN for this task. The second proposed
method explicitly models the video as an ordered sequence of frames. For this
purpose we employ a recurrent neural network that uses Long Short-Term Memory
(LSTM) cells which are connected to the output of the underlying CNN. Our best
networks exhibit significant performance improvements over previously published
results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101
datasets with (88.6% vs. 88.0%) and without additional optical flow information
(82.6% vs. 72.8%)
Is Bottom-Up Attention Useful for Scene Recognition?
The human visual system employs a selective attention mechanism to understand
the visual world in an eficient manner. In this paper, we show how
computational models of this mechanism can be exploited for the computer vision
application of scene recognition. First, we consider saliency weighting and
saliency pruning, and provide a comparison of the performance of different
attention models in these approaches in terms of classification accuracy.
Pruning can achieve a high degree of computational savings without
significantly sacrificing classification accuracy. In saliency weighting,
however, we found that classification performance does not improve. In
addition, we present a new method to incorporate salient and non-salient
regions for improved classification accuracy. We treat the salient and
non-salient regions separately and combine them using Multiple Kernel Learning.
We evaluate our approach using the UIUC sports dataset and find that with a
small training size, our method improves upon the classification accuracy of
the baseline bag of features approach
Compositional Structure Learning for Action Understanding
The focus of the action understanding literature has predominately been
classification, how- ever, there are many applications demanding richer action
understanding such as mobile robotics and video search, with solutions to
classification, localization and detection. In this paper, we propose a
compositional model that leverages a new mid-level representation called
compositional trajectories and a locally articulated spatiotemporal deformable
parts model (LALSDPM) for fully action understanding. Our methods is
advantageous in capturing the variable structure of dynamic human activity over
a long range. First, the compositional trajectories capture long-ranging,
frequently co-occurring groups of trajectories in space time and represent them
in discriminative hierarchies, where human motion is largely separated from
camera motion; second, LASTDPM learns a structured model with multi-layer
deformable parts to capture multiple levels of articulated motion. We implement
our methods and demonstrate state of the art performance on all three problems:
action detection, localization, and recognition.Comment: 13 page
Beyond Spatial Pyramid Matching: Space-time Extended Descriptor for Action Recognition
We address the problem of generating video features for action recognition.
The spatial pyramid and its variants have been very popular feature models due
to their success in balancing spatial location encoding and spatial invariance.
Although it seems straightforward to extend spatial pyramid to the temporal
domain (spatio-temporal pyramid), the large spatio-temporal diversity of
unconstrained videos and the resulting significantly higher dimensional
representations make it less appealing. This paper introduces the space-time
extended descriptor, a simple but efficient alternative way to include the
spatio-temporal location into the video features. Instead of only coding motion
information and leaving the spatio-temporal location to be represented at the
pooling stage, location information is used as part of the encoding step. This
method is a much more effective and efficient location encoding method as
compared to the fixed grid model because it avoids the danger of over
committing to artificial boundaries and its dimension is relatively low.
Experimental results on several benchmark datasets show that, despite its
simplicity, this method achieves comparable or better results than
spatio-temporal pyramid
- …