425 research outputs found
Spatiotemporal visual analysis of human actions
In this dissertation we propose four methods for the recognition of human activities. In all four of
them, the representation of the activities is based on spatiotemporal features that are automatically
detected at areas where there is a significant amount of independent motion, that is, motion that is
due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features
throughout this dissertation. The algorithms presented, however, can be used with any kind of features,
as long as the latter are well localized and have a well-defined area of support in space and time. We
introduce the utilized spatiotemporal salient points in the first method presented in this dissertation.
By extending previous work on spatial saliency, we measure the variations in the information content of
pixel neighborhoods both in space and time, and detect the points at the locations and scales for which
this information content is locally maximized. In this way, an activity is represented as a collection of
spatiotemporal salient points. We propose an iterative linear space-time warping technique in order
to align the representations in space and time and propose to use Relevance Vector Machines (RVM)
in order to classify each example into an action category. In the second method proposed in this
dissertation we propose to enhance the acquired representations of the first method. More specifically,
we propose to track each detected point in time, and create representations based on sets of trajectories,
where each trajectory expresses how the information engulfed by each salient point evolves over time.
In order to deal with imperfect localization of the detected points, we augment the observation model
of the tracker with background information, acquired using a fully automatic background estimation
algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels.
In addition, we perform experiments where the tracked templates are localized on specific parts of the
body, like the hands and the head, and we further augment the tracker’s observation model using a
human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm
(LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and
RVMs for classification. In the third method that we propose, we assume that neighboring salient
points follow a similar motion. This is in contrast to the previous method, where each salient point was
tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual
descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The
latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal
neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in
translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the
corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are
extracted across the whole dataset are subsequently clustered in order to create a codebook, which is
used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for
classification. The fourth and last method addresses the joint problem of localization and recognition
of human activities depicted in unsegmented image sequences. Its main contribution is the use of an
implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal
localization of characteristic ensembles of spatiotemporal features. The latter are localized around
automatically detected salient points. Evidence for the spatiotemporal localization of the activity
is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in
order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct
class-specific spatiotemporal models, which encode where in space and time each codeword ensemble
appears in the training set. During testing, each activated codeword ensemble casts probabilistic
votes concerning the spatiotemporal localization of the activity, according to the information stored
during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable
hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume
which potentially engulfs the activity, and is verified by performing action category classification with
an RVM classifier
Recognising activities by jointly modelling actions and their effects
With the rapid increase in adoption of consumer technologies, including inexpensive
but powerful hardware, robotics appears poised at the cusp of widespread deployment
in human environments. A key barrier that still prevents this is the machine understanding
and interpretation of human activity, through a perceptual medium such as
computer vision, or RBG-D sensing such as with the Microsoft Kinect sensor.
This thesis contributes novel video-based methods for activity recognition. Specifically,
the focus is on activities that involve interactions between the human user and
objects in the environment. Based on streams of poses and object tracking, machine
learning models are provided to recognize various of these interactions. The thesis
main contributions are (1) a new model for interactions that explicitly learns the
human-object relationships through a latent distributed representation, (2) a practical
framework for labeling chains of manipulation actions in temporally extended activities
and (3) an unsupervised sequence segmentation technique that relies on slow feature
analysis and spectral clustering.
These techniques are validated by experiments with publicly available data sets,
such as the Cornell CAD-120 activity corpus which is one of the most extensive publicly
available such data sets that is also annotated with ground truth information. Our
experiments demonstrate the advantages of the proposed methods, over and above state
of the art alternatives from the recent literature on sequence classifiers
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
Recommended from our members
Physical Activity Classification with Conditional Random Fields
In this thesis we develop methods for classifying physical activity using accelerometer recordings. We cast this as a problem of classification in time series with moderate to high dimensional observations at each time point. Specifically, we observe a vector of summary statistics of the accelerometer signal at each point in time, and we wish to use these observations to estimate the type and intensity of physical activity the individual engaged in as it changes over time.
Our methods are based on Conditional Random Fields, which allow us to capture temporal dependence in an individual’s physical activity type without requiring us to model the distribution of the observed features at each point in time. We develop three novel estimation strategies for Conditional Random Fields, evaluate their performance on classification tasks through simulation studies and demonstrate their use in applications with real physical activity data sets
Sampling - Variational Auto Encoder - Ensemble: In the Quest of Explainable Artificial Intelligence
Explainable Artificial Intelligence (XAI) models have recently attracted a
great deal of interest from a variety of application sectors. Despite
significant developments in this area, there are still no standardized methods
or approaches for understanding AI model outputs. A systematic and cohesive
framework is also increasingly necessary to incorporate new techniques like
discriminative and generative models to close the gap. This paper contributes
to the discourse on XAI by presenting an empirical evaluation based on a novel
framework: Sampling - Variational Auto Encoder (VAE) - Ensemble Anomaly
Detection (SVEAD). It is a hybrid architecture where VAE combined with ensemble
stacking and SHapley Additive exPlanations are used for imbalanced
classification. The finding reveals that combining ensemble stacking, VAE, and
SHAP can. not only lead to better model performance but also provide an easily
explainable framework. This work has used SHAP combined with Permutation
Importance and Individual Conditional Expectations to create a powerful
interpretability of the model. The finding has an important implication in the
real world, where the need for XAI is paramount to boost confidence in AI
applications.Comment: 8 pages, 10 figures, IEEE conference (IEIT 2023
- …