1,038 research outputs found
Recommended from our members
Human extremity detection and its applications in action detection and recognition
textIt is proven that locations of internal body joints are sufficient visual cues to characterize human motion. In this dissertation I propose that locations of human extremities including heads, hands and feet provide powerful approximation to internal body motion. I propose detection of precise extremities from contours obtained from image segmentation or contour tracking. Junctions of medial axis of contours are selected as stars. Contour points with a local maximum distance to various stars are chosen as candidate extremities. All the candidates are filtered by cues including proximity to other candidates, visibility to stars and robustness to noise smoothing parameters. I present my applications of using precise extremities for fast human action detection and recognition. Environment specific features are built from precise extremities and feed into a block based Hidden Markov Model to decode the fence climbing action from continuous videos. Precise extremities are grouped into stable contacts if the same extremity does not move for a certain duration. Such stable contacts are utilized to decompose a long continuous video into shorter pieces. Each piece is associated with certain motion features to form primitive motion units. In this way the sequence is abstracted into more meaningful segments and a searching strategy is used to detect the fence climbing action. Moreover, I propose the histogram of extremities as a general posture descriptor. It is tested in a Hidden Markov Model based framework for action recognition. I further propose detection of probable extremities from raw images without any segmentation. Modeling the extremity as an image patch instead of a single point on the contour helps overcome the segmentation difficulty and increase the detection robustness. I represent the extremity patches with Histograms of Oriented Gradients. The detection is achieved by window based image scanning. In order to reduce computation load, I adopt the integral histograms technique without sacrificing accuracy. The result is a probability map where each pixel denotes probability of the patch forming the specific class of extremities. With a probable extremity map, I propose the histogram of probable extremities as another general posture descriptor. It is tested on several data sets and the results are compared with that of precise extremities to show the superiority of probable extremities.Electrical and Computer Engineerin
Towards practical automated human action recognition
University of Technology, Sydney. Faculty of Engineering and Information Technology.Modern video surveillance requires addressing high-level concepts such as humans' actions and
activities. Automated human action recognition is an interesting research area, as well as one of the
main trends in the automated video survei1lance industry. The typical goal of action recognition is that
of labelling an image sequence (video) using one out of a set of action labels. In general, it requires
the extraction of a feature set from the relevant video, fo1lowed by the classification of the extracted
features. Despite the many approaches for feature set extraction and classification proposed to date,
some barriers for practical action recognition sti11 exist. We argue that recognition accuracy, speed,
robustness and the required hardware are the main factors to build a practical human action
recognition system to be run on a typical PC for a real-time video surveillance application. For
example, a computationally-heavy set of measurements may prevent practical implementation on
common platforms.
The main focus of this thesis is challenging the main difficulties and proposing solution. towards a
practical action recognition system. The main outstanding difficulties that we have challenged in this
thesis include 1) initialisation issues with model training: 2) feature sets of limited computational
weight sui table for real-ti me application; 3) model robustness to outliers; and 4) pending issues with
the standardisation of software interfaces. In the following, we provide a description of our
contributions to the resolution of these issues.
Amongst the different classification approaches for classifying action , graphical model such as
the hidden Markov model (HMM) have been widely exploited by many researchers. Such models
include observation probabilities which are generally modelled by mixtures of Gaussian components.
When learning an HMM by way of Expectation-Maximisation (EM) algorithms, arbitrary choices
must be made for their initial parameters. The initial choices have a major impact on the parameters at
convergence and, in turn, on the recognition accuracy. This dependence forces us to repeat training
with different initial parameters until satisfactory cross-validation accuracy is attained. Such a process
is overall empirical and time consuming.
We argue that one-off initialisation can offer a better trade-off between training time and accuracy,
and as one of the main contributions of this thesis, we propose two methods for deterministic
initialisation of the Gaussian components' centres. The first method is a time segmentation-based
approach which divides each training sequence into the requested number of clusters (product of the
number of HMM states and the number of Gaussian components in each state) in the time domain.
Then, clusters' centres are averaged among all the training sequences to compute the initial centre for
each Gaussian component. The second approach is a histogram-based approach which tries to
initialise the components' centres with the more popular values among the training data in terms of
density (similar to mode seeking approaches). The histogram-based approach is performed
incrementally, considering each feature at a time. Either centre initialisation approach is followed by
dispatching the resulting Gaussian components onto HMM states. The reference component
dispatching method exploits the arbitrary order for dispatching. In contrast, we again propose two
more intelligent methods based on the effort to put components with closer centres in the same state
which can improve the co1Tect recognition rate.
Experiments over three human action video datasets (Weizmann [1 ], MuHAVi [2] and Hollywood
[3]) prove that our proposed deterministic initialisation methods are capable of achieving accuracy
above the average of repeated random initialisations (about 1 per cent to 3 per cent in 6 random run
experiment) and comparable to the best. At the same time, one-off deterministic initialisation can save
the required training time substantially compared to repeated random initialisations, e.g. up to 83% in
the case of 6 runs of random initialisation. The proposed methods are general as they naturally extend
to other models where observation densities are conditioned on discrete latent variables, such as
dynamic Bayesian networks (DBNs) and switching models .
As another contribution, we propose a simple and computationally lightweight feature set, named
sectorial extreme points, which requires only 1.6 ms per frame for extraction on a reference PC. We
believe a lightweight feature set is more appropriate for the task of action recognition in real-time
surveillance applications with the usual requirement of processing 25 frames per second (PAL video
rate). The proposed feature set represents the coordinates of the extreme points in the contour of a
subject's foreground mask. The various experiments prove the strength of the proposed feature set in
terms of classification accuracy, compared to similar feature sets, such as the star skeleton [4] (by
more than 3%) and the well-known projection histograms (up to 7%).
Another main issue in density modelling of the extracted features is the outlier problem. The
extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can
severely affect density modelling when the Gaussian distribution is used as the model since it is short-tailed
and highly sensitive to outliers. Hence, outliers can affect the classification accuracy of the
HMM-based action recognition approaches that exploit Gaussian distribution as the base component.
In contrast, the Student' s t-distribution is more robust to outliers thanks to its longer tail and can be
exploited for density modelling to improve the recognition rate in the presence of abnormal data. As
another main contribution, we present an HMM which uses mixtures of t-distributions as observation
probabilities and apply it for the recognition task. The conducted experiments over the Weizmann and
MuHAVi datasets with various feature sets report a remarkable improvement of up to 9% in
classification accuracy by using HMM with mixtures of t-distributions instead of mixture of
Gaussians. Using our own proposed sectorial extreme points feature set, we have achieved the
maximum possible classification accuracy (100%) over the Weizmann dataset. This achievement
should be considered jointly with the fact that we have used a lightweight feature set.
On a different ground, and from the implementation viewpoint, surveillance software for
automated human action recognition requires portability over a variety of platforms, from servers to
mobile devices. The current products mainly target low level video analysis tasks, e.g. video
annotation, instead of higher level ones, such as action recognition. Therefore, we explore the
potential of the MPEG-7 standard to provide a standard interface platform (through descriptors and
architectures) for human action recognition from surveillance cameras. As the last contribution of this
work, we present two novel MPEG-7 descriptors, one symbolic and the other feature-based, alongside
two different architectures: the server-intensive which is more suitable for "thin" client devices , such
as PDAs and the client-intensive that is more appropriate for ''thick" clients, such as desktops. We
evaluate the proposed descriptors and architectures by way of a scenario analysis.
We believe that through the four contributions of this thesis, human action recognition systems
have become more practical. While some contributions are specific to generative models such as the
HMM, other contributions are more general and can be exploited with other classification approaches.
We acknowledge that the entire area of human action recognition is progressing at an enormous pace,
and that other outstanding issues are being resolved by research groups world-wide. We hope that the
reader will enjoy the content of this work
Multi-Action Recognition via Stochastic Modelling of Optical Flow and Gradients
In this paper we propose a novel approach to multi-action recognition that
performs joint segmentation and classification. This approach models each
action using a Gaussian mixture using robust low-dimensional action features.
Segmentation is achieved by performing classification on overlapping temporal
windows, which are then merged to produce the final result. This approach is
considerably less complicated than previous methods which use dynamic
programming or computationally expensive hidden Markov models (HMMs). Initial
experiments on a stitched version of the KTH dataset show that the proposed
approach achieves an accuracy of 78.3%, outperforming a recent HMM-based
approach which obtained 71.2%
Continuous Interaction with a Virtual Human
Attentive Speaking and Active Listening require that a Virtual Human be capable of simultaneous perception/interpretation and production of communicative behavior. A Virtual Human should be able to signal its attitude and attention while it is listening to its interaction partner, and be able to attend to its interaction partner while it is speaking – and modify its communicative behavior on-the-fly based on what it perceives from its partner. This report presents the results of a four week summer project that was part of eNTERFACE’10. The project resulted in progress on several aspects of continuous interaction such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and models for appropriate reactions to listener responses. A pilot user study was conducted with ten participants. In addition, the project yielded a number of deliverables that are released for public access
Human activity recognition for the use in intelligent spaces
The aim of this Graduation Project is to develop a generic biological inspired activity recognition system for the use in intelligent spaces. Intelligent spaces form the context for this project. The goal is to develop a working prototype that can learn and recognize human activities from a limited training set in all kinds of spaces and situations. For testing purposes, the office environment is chosen as subject for the intelligent space. The purpose of the intelligent space, in this case the office, is left out of the scope of the project. The scope is limited to the perceptive system of the intelligent space. The notion is that the prototype should not be bound to a specific space, but it should be a generic perceptive system able to cope in any given space within the build environment. The fact that no space is the same, developing a prototype without any domain knowledge in which it can learn and recognize activities, is the main challenge of this project. In al layers of the prototype, the data processing is kept as abstract and low level as possible to keep it as generic as possible. This is done by using local features, scale invariant descriptors and by using hidden Markov models for pattern recognition. The novel approach of the prototype is that it combines structure as well as motion features in one system making it able to train and recognize a variety of activities in a variety of situations. From rhythmic expressive actions with a simple cyclic pattern to activities where the movement is subtle and complex like typing and reading, can all be trained and recognized. The prototype has been tested on two very different data sets. The first set in which the videos are shot in a controlled environment in which simple actions were performed. The second set in which videos are shot in a normal office where daily office activities are captured and categorized afterwards. The prototype has given some promising results proving it can cope with very different spaces, actions and activities. The aim of this Graduation Project is to develop a generic biological inspired activity recognition system for the use in intelligent spaces. Intelligent spaces form the context for this project. The goal is to develop a working prototype that can learn and recognize human activities from a limited training set in all kinds of spaces and situations. For testing purposes, the office environment is chosen as subject for the intelligent space. The purpose of the intelligent space, in this case the office, is left out of the scope of the project. The scope is limited to the perceptive system of the intelligent space. The notion is that the prototype should not be bound to a specific space, but it should be a generic perceptive system able to cope in any given space within the build environment. The fact that no space is the same, developing a prototype without any domain knowledge in which it can learn and recognize activities, is the main challenge of this project. In al layers of the prototype, the data processing is kept as abstract and low level as possible to keep it as generic as possible. This is done by using local features, scale invariant descriptors and by using hidden Markov models for pattern recognition. The novel approach of the prototype is that it combines structure as well as motion features in one system making it able to train and recognize a variety of activities in a variety of situations. From rhythmic expressive actions with a simple cyclic pattern to activities where the movement is subtle and complex like typing and reading, can all be trained and recognized. The prototype has been tested on two very different data sets. The first set in which the videos are shot in a controlled environment in which simple actions were performed. The second set in which videos are shot in a normal office where daily office activities are captured and categorized afterwards. The prototype has given some promising results proving it can cope with very different spaces, actions and activities
Robust density modelling using the student's t-distribution for human action recognition
The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is highly sensitive to outliers. The Gaussian distribution is also often used as base component of graphical models for recognising human actions in the videos (hidden Markov model and others) and the presence of outliers can significantly affect the recognition accuracy. In contrast, the Student's t-distribution is more robust to outliers and can be exploited to improve the recognition rate in the presence of abnormal data. In this paper, we present an HMM which uses mixtures of t-distributions as observation probabilities and show how experiments over two well-known datasets (Weizmann, MuHAVi) reported a remarkable improvement in classification accuracy. © 2011 IEEE
Multi-view human action recognition using 2D motion templates based on MHIs and their HOG description
In this study, a new multi-view human action recognition approach is proposed by exploiting low-dimensional motion information of actions. Before feature extraction, pre-processing steps are performed to remove noise from silhouettes, incurred due to imperfect, but realistic segmentation. Two-dimensional motion templates based on motion history image (MHI) are computed for each view/action video. Histograms of oriented gradients (HOGs) are used as an efficient description of the MHIs which are classified using nearest neighbor (NN) classifier. As compared with existing approaches, the proposed method has three advantages: (i) does not require a fixed number of cameras setup during training and testing stages hence missing camera-views can be tolerated, (ii) requires less memory and bandwidth requirements and hence (iii) is computationally efficient which makes it suitable for real-time action recognition. As far as the authors know, this is the first report of results on the MuHAVi-uncut dataset having a large number of action categories and a large set of camera-views with noisy silhouettes which can be used by future workers as a baseline to improve on. Experimentation results on multi-view with this dataset gives a high-accuracy rate of 95.4% using leave-one-sequence-out cross-validation technique and compares well to similar state-of-the-art approachesSergio A Velastin acknowledges the Chilean National Science and Technology Council (CONICYT) for its funding under grant CONICYT-Fondecyt Regular no. 1140209 (“OBSERVE”). He is currently funded by the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement nº 600371, el Ministerio de Economía y Competitividad (COFUND2013-51509) and Banco Santander
- …