645 research outputs found
Log-Euclidean Bag of Words for Human Action Recognition
Representing videos by densely extracted local space-time features has
recently become a popular approach for analysing actions. In this paper, we
tackle the problem of categorising human actions by devising Bag of Words (BoW)
models based on covariance matrices of spatio-temporal features, with the
features formed from histograms of optical flow. Since covariance matrices form
a special type of Riemannian manifold, the space of Symmetric Positive Definite
(SPD) matrices, non-Euclidean geometry should be taken into account while
discriminating between covariance matrices. To this end, we propose to embed
SPD manifolds to Euclidean spaces via a diffeomorphism and extend the BoW
approach to its Riemannian version. The proposed BoW approach takes into
account the manifold geometry of SPD matrices during the generation of the
codebook and histograms. Experiments on challenging human action datasets show
that the proposed method obtains notable improvements in discrimination
accuracy, in comparison to several state-of-the-art methods
LEARNING SALIENCY FOR HUMAN ACTION RECOGNITION
PhDWhen we are looking at a visual stimuli, there are certain areas that stand out
from the neighbouring areas and immediately grab our attention. A map that identi-
es such areas is called a visual saliency map. As humans can easily recognize actions
when watching videos, having their saliency maps available might be bene cial for
a fully automated action recognition system. In this thesis we look into ways of
learning to predict the visual saliency and how to use the learned saliency for action
recognition.
In the rst phase, as opposed to the approaches that use manually designed fea-
tures for saliency prediction, we propose few multilayer architectures for learning
saliency features. First, we learn rst layer features in a two layer architecture using
an unsupervised learning algorithm. Second, we learn second layer features in a two
layer architecture using a supervision from recorded human gaze xations. Third, we
use a deep architecture that learns features at all layers using only supervision from
recorded human gaze xations.
We show that the saliency prediction results we obtain are better than those
obtained by approaches that use manually designed features. We also show that
using a supervision on higher levels yields better saliency prediction results, i.e. the
second approach outperforms the rst, and the third outperforms the second.
In the second phase we focus on how saliency can be used to localize areas that will
be used for action classi cation. In contrast to the manually designed action features,
such as HOG/HOF, we learn the features using a fully supervised deep learning
architecture. We show that our features in combination with the predicted saliency
(from the rst phase) outperform manually designed features. We further develop
an SVM framework that uses the predicted saliency and learned action features to
both localize (in terms of bounding boxes) and classify the actions. We use saliency
prediction as an additional cost in the SVM training and testing procedure when
inferring the bounding box locations. We show that the approach in which saliency
cost is added yields better action recognition results than the approach in which the
cost is not added. The improvement is larger when the cost is added both in training
and testing, rather than just in testing
Evolving weighting schemes for the Bag of Visual Words
The Bag of Visual Words (BoVW) is an established representation in computer vision. Taking inspiration from text mining, this representation has proved to be very effective in many domains. However, in most cases, standard term-weighting schemes are adopted (e.g., term-frequency or TF-IDF). It remains open the question of whether alternative weighting schemes could boost the performance of methods based on BoVW. More importantly, it is unknown whether it is possible to automatically learn and determine effective weighting schemes from scratch. This paper brings some light into both of these unknowns. On the one hand, we report an evaluation of the most common weighting schemes used in text mining, but rarely used in computer vision tasks. Besides, we propose an evolutionary algorithm capable of automatically learning weighting schemes for computer vision problems. We report empirical results of an extensive study in several computer vision problems. Results show the usefulness of the proposed method
- …