328,386 research outputs found
Towards practical automated human action recognition
University of Technology, Sydney. Faculty of Engineering and Information Technology.Modern video surveillance requires addressing high-level concepts such as humans' actions and
activities. Automated human action recognition is an interesting research area, as well as one of the
main trends in the automated video survei1lance industry. The typical goal of action recognition is that
of labelling an image sequence (video) using one out of a set of action labels. In general, it requires
the extraction of a feature set from the relevant video, fo1lowed by the classification of the extracted
features. Despite the many approaches for feature set extraction and classification proposed to date,
some barriers for practical action recognition sti11 exist. We argue that recognition accuracy, speed,
robustness and the required hardware are the main factors to build a practical human action
recognition system to be run on a typical PC for a real-time video surveillance application. For
example, a computationally-heavy set of measurements may prevent practical implementation on
common platforms.
The main focus of this thesis is challenging the main difficulties and proposing solution. towards a
practical action recognition system. The main outstanding difficulties that we have challenged in this
thesis include 1) initialisation issues with model training: 2) feature sets of limited computational
weight sui table for real-ti me application; 3) model robustness to outliers; and 4) pending issues with
the standardisation of software interfaces. In the following, we provide a description of our
contributions to the resolution of these issues.
Amongst the different classification approaches for classifying action , graphical model such as
the hidden Markov model (HMM) have been widely exploited by many researchers. Such models
include observation probabilities which are generally modelled by mixtures of Gaussian components.
When learning an HMM by way of Expectation-Maximisation (EM) algorithms, arbitrary choices
must be made for their initial parameters. The initial choices have a major impact on the parameters at
convergence and, in turn, on the recognition accuracy. This dependence forces us to repeat training
with different initial parameters until satisfactory cross-validation accuracy is attained. Such a process
is overall empirical and time consuming.
We argue that one-off initialisation can offer a better trade-off between training time and accuracy,
and as one of the main contributions of this thesis, we propose two methods for deterministic
initialisation of the Gaussian components' centres. The first method is a time segmentation-based
approach which divides each training sequence into the requested number of clusters (product of the
number of HMM states and the number of Gaussian components in each state) in the time domain.
Then, clusters' centres are averaged among all the training sequences to compute the initial centre for
each Gaussian component. The second approach is a histogram-based approach which tries to
initialise the components' centres with the more popular values among the training data in terms of
density (similar to mode seeking approaches). The histogram-based approach is performed
incrementally, considering each feature at a time. Either centre initialisation approach is followed by
dispatching the resulting Gaussian components onto HMM states. The reference component
dispatching method exploits the arbitrary order for dispatching. In contrast, we again propose two
more intelligent methods based on the effort to put components with closer centres in the same state
which can improve the co1Tect recognition rate.
Experiments over three human action video datasets (Weizmann [1 ], MuHAVi [2] and Hollywood
[3]) prove that our proposed deterministic initialisation methods are capable of achieving accuracy
above the average of repeated random initialisations (about 1 per cent to 3 per cent in 6 random run
experiment) and comparable to the best. At the same time, one-off deterministic initialisation can save
the required training time substantially compared to repeated random initialisations, e.g. up to 83% in
the case of 6 runs of random initialisation. The proposed methods are general as they naturally extend
to other models where observation densities are conditioned on discrete latent variables, such as
dynamic Bayesian networks (DBNs) and switching models .
As another contribution, we propose a simple and computationally lightweight feature set, named
sectorial extreme points, which requires only 1.6 ms per frame for extraction on a reference PC. We
believe a lightweight feature set is more appropriate for the task of action recognition in real-time
surveillance applications with the usual requirement of processing 25 frames per second (PAL video
rate). The proposed feature set represents the coordinates of the extreme points in the contour of a
subject's foreground mask. The various experiments prove the strength of the proposed feature set in
terms of classification accuracy, compared to similar feature sets, such as the star skeleton [4] (by
more than 3%) and the well-known projection histograms (up to 7%).
Another main issue in density modelling of the extracted features is the outlier problem. The
extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can
severely affect density modelling when the Gaussian distribution is used as the model since it is short-tailed
and highly sensitive to outliers. Hence, outliers can affect the classification accuracy of the
HMM-based action recognition approaches that exploit Gaussian distribution as the base component.
In contrast, the Student' s t-distribution is more robust to outliers thanks to its longer tail and can be
exploited for density modelling to improve the recognition rate in the presence of abnormal data. As
another main contribution, we present an HMM which uses mixtures of t-distributions as observation
probabilities and apply it for the recognition task. The conducted experiments over the Weizmann and
MuHAVi datasets with various feature sets report a remarkable improvement of up to 9% in
classification accuracy by using HMM with mixtures of t-distributions instead of mixture of
Gaussians. Using our own proposed sectorial extreme points feature set, we have achieved the
maximum possible classification accuracy (100%) over the Weizmann dataset. This achievement
should be considered jointly with the fact that we have used a lightweight feature set.
On a different ground, and from the implementation viewpoint, surveillance software for
automated human action recognition requires portability over a variety of platforms, from servers to
mobile devices. The current products mainly target low level video analysis tasks, e.g. video
annotation, instead of higher level ones, such as action recognition. Therefore, we explore the
potential of the MPEG-7 standard to provide a standard interface platform (through descriptors and
architectures) for human action recognition from surveillance cameras. As the last contribution of this
work, we present two novel MPEG-7 descriptors, one symbolic and the other feature-based, alongside
two different architectures: the server-intensive which is more suitable for "thin" client devices , such
as PDAs and the client-intensive that is more appropriate for ''thick" clients, such as desktops. We
evaluate the proposed descriptors and architectures by way of a scenario analysis.
We believe that through the four contributions of this thesis, human action recognition systems
have become more practical. While some contributions are specific to generative models such as the
HMM, other contributions are more general and can be exploited with other classification approaches.
We acknowledge that the entire area of human action recognition is progressing at an enormous pace,
and that other outstanding issues are being resolved by research groups world-wide. We hope that the
reader will enjoy the content of this work
Human action recognition based on key postures
University of Technology, Sydney. Faculty of Engineering and Information Technology.Human motion analysis has gained considerable interests in the computer vision area
due to the large number of potential applications and its inherent complexity. Currently,
human motion analysis is at an early stage. Its final aim is to generate an
easy understanding, high level semantic description in a given scene. Human action
recognition is an important step to the final aim of human motion analysis.
Human Detection
Human detection is part of the field of human motion analysis. The thesis looks
at human detection. The thesis proposes a method using histogram of angles to
discriminate pedestrians from vehicles. This proposed method is encouraged by the
reality that humans are non-rigid objects, An angle formed by the centroid point and
two bottom points for a human changes periodically while the angle for the vehicle is
relatively static. In this part, this thesis also presents an approach to detect humans in
static images. The thesis proposes an approach which uses human geometric features
to fulfill the task.
Human Action Recognition
The thesis focuses on human action recognition. The thesis proposes what will be
called a key postures based human action recognition approach. As we have known,
human actions can be well described by a few important postures (called key postures)
which are significantly different from each other and all other postures can be
clustered to these key postures. Therefore, these key postures can be used to represent
and to infer the corresponding human action. The benefit of using key postures to
represent human action is to reduce computational complexity. The thesis proposes
two methods for human action recognition based on key postures. One is a human action
recognition based on shape features and the other one is action recognition based
on Radon transforms. Both methods follow three steps to achieve action recognition.
These steps are video processing, key posture extraction and action recognition.
A two-step approach is proposed to extract key postures from preprocessed action
video. These two steps are coarse selection and fine selection. Feature extraction and
representation are discussed in both steps. After key postures are extracted from a
video, key posture sequences are used to represent human actions. Each key posture
sequence is regarded as an action template. In order to compare two action sequences,
Dynamic Time Warping (DTW) is applied to determine the distance between the two
action sequences.
In the second method, in order to obtain key postures, the action sequences are
extracted from the preprocessed silhouettes using Radon transforms. Then, an unsupervised
cluster analysis is applied to Radon transforms to identify the key postures
for each sequence. Such key postures are used in the subsequent training and testing
procedure. Several benchmark classifiers are used in this work for action learning and
classification.
Author's Publications
This thesis covers the research results conducted by the author while undertaking for
the degree. Most of the results have been published in research papers in refereed
publications which are listed in Author's Publication for Doctor of Philosophy (PhD)
STV-based Video Feature Processing for Action Recognition
In comparison to still image-based processes, video features can provide rich and intuitive information about dynamic events occurred over a period of time, such as human actions, crowd behaviours, and other subject pattern changes. Although substantial progresses have been made in the last decade on image processing and seen its successful applications in face matching and object recognition, video-based event detection still remains one of the most difficult challenges in computer vision research due to its complex continuous or discrete input signals, arbitrary dynamic feature definitions, and the often ambiguous analytical methods. In this paper, a Spatio-Temporal Volume (STV) and region intersection (RI) based 3D shape-matching method has been proposed to facilitate the definition and recognition of human actions recorded in videos. The distinctive characteristics and the performance gain of the devised approach stemmed from a coefficient factor-boosted 3D region intersection and matching mechanism developed in this research. This paper also reported the investigation into techniques for efficient STV data filtering to reduce the amount of voxels (volumetric-pixels) that need to be processed in each operational cycle in the implemented system. The encouraging features and improvements on the operational performance registered in the experiments have been discussed at the end
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
Efficient and effective human action recognition in video through motion boundary description with a compact set of trajectories
Human action recognition (HAR) is at the core of human-computer interaction and video scene understanding. However, achieving effective HAR in an unconstrained environment is still a challenging task. To that end, trajectory-based video representations are currently widely used. Despite the promising levels of effectiveness achieved by these approaches, problems regarding computational complexity and the presence of redundant trajectories still need to be addressed in a satisfactory way. In this paper, we propose a method for trajectory rejection, reducing the number of redundant trajectories without degrading the effectiveness of HAR. Furthermore, to realize efficient optical flow estimation prior to trajectory extraction, we integrate a method for dynamic frame skipping. Experiments with four publicly available human action datasets show that the proposed approach outperforms state-of-the-art HAR approaches in terms of effectiveness, while simultaneously mitigating the computational complexity
An original framework for understanding human actions and body language by using deep neural networks
The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour.
By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way.
These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively.
While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements;
both are essential tasks in many computer vision applications, including event recognition, and video surveillance.
In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided.
The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements.
All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
- …