29 research outputs found

    Recognising and localising human actions

    Get PDF
    Human action recognition in challenging video data is becoming an increasingly important research area. Given the growing number of cameras and robots pointing their lenses at humans, the need for automatic recognition of human actions arises, promising Google-style video search and automatic video summarisation/description. Furthermore, for any autonomous robotic system to interact with humans, it must rst be able to understand and quickly react to human actions. Although the best action classication methods aggregate features from the entire video clip in which the action unfolds, this global representation may include irrelevant scene context and movements which are shared amongst multiple action classes. For example, a waving action may be performed whilst walking, however if the walking movement appears in distinct action classes, then it should not be included in training a waving movement classier. For this reason, we propose an action classication framework in which more discriminative action subvolumes are learned in a weakly supervised setting, owing to the diculty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume. Each subvolume is cast as a bag-of-features (BoF) instance in a multiple-instance-learning framework, which in turn is used to learn its class membership. We demonstrate quantitatively that even with single xed-sized subvolumes, the classication performance of our proposed algorithm is superior to our BoF baseline on the majority of performance measures, and shows promise for space-time action localisation on the most challenging video datasets. Exploiting spatio-temporal structure in the video should also improve results, just as deformable part models have proven highly successful in object recognition. However, whereas objects have clear boundaries which means we can easily dene a ground truth for initialisation, 3D space-time actions are inherently ambiguous and expensive to annotate in large datasets. Thus, it is desirable to adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features (LDSBoF) in which local discriminative regions are split into axed grid of parts that are allowed to deform in both space and time at test-time. In our experimental evaluation we demonstrate that by using local, deformable space-time action parts, we are able to achieve very competitive classification performance, whilst being able to localise actions even in the most challenging video datasets. A recent trend in action recognition is towards larger and more challenging datasets, an increasing number of action classes and larger visual vocabularies. For the global classication of human action video clips, the bag-of-visual-words pipeline is currently the best performing. However, the strategies chosen to sample features and construct a visual vocabulary are critical to performance, in fact often dominating performance. Thus, we provide a critical evaluation of various approaches to building a vocabulary and show that good practises do have a signicant impact. By subsampling and partitioning features strategically, we are able to achieve state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies. Another promising approach to recognise human actions first encodes the action sequence via a generative dynamical model. However, using classical distances for their classication does not necessarily deliver good results. Therefore we propose a general framework for learning distance functions between dynamical models, given a training set of labelled videos. The optimal distance function is selected among a family of `pullback' ones, induced by a parametrised mapping of the space of models. We focus here on hidden Markov models and their model space, and show how pullback distance learning greatly improves action recognition performances with respect to base distances. Finally, the action classication systems that use a single global representation for each video clip are tailored for oine batch classication benchmarks. For human-robot interaction however, current systems fall short, either because they can only detect one human action per video frame, or because they assume the video is available ahead of time. In this work we propose an online human action detection system that can incrementally detect multiple concurrent space-time actions. In this way, it becomes possible to learn new action classes on-the-fly, allowing multiple people to actively teach and interact with a robot

    Metric learning for Parkinsonian identification from IMU gait measurements

    Get PDF
    Diagnosis of people with mild Parkinson’s symptoms is difficult. Nevertheless, variations in gait pattern can be utilised to this purpose, when measured via Inertial Measurement Units (IMUs). Human gait, however, possesses a high degree of variability across individuals, and is subject to numerous nuisance factors. Therefore, off-the-shelf Machine Learning techniques may fail to classify it with the accuracy required in clinical trials. In this paper we propose a novel framework in which IMU gait measurement sequences sampled during a 10 metre walk are first encoded as hidden Markov models (HMMs) to extract their dynamics and provide a fixed-length representation. Given sufficient training samples, the distance between HMMs which optimises classification performance is learned and employed in a classical Nearest Neighbour classifier. Our tests demonstrate how this technique achieves accuracy of 85.51% over a 156 people with Parkinson’s with a representative range of severity and 424 typically developed adults, which is the top performance achieved so far over a cohort of such size, based on single measurement outcomes. The method displays the potential for further improvement and a wider application to distinguish other conditions

    Gesture Modeling by Hanklet-based Hidden Markov Model

    Get PDF
    In this paper we propose a novel approach for gesture modeling. We aim at decomposing a gesture into sub-trajectories that are the output of a sequence of atomic linear time invariant (LTI) systems, and we use a Hidden Markov Model to model the transitions from the LTI system to another. For this purpose, we represent the human body motion in a temporal window as a set of body joint trajectories that we assume are the output of an LTI system. We describe the set of trajectories in a temporal window by the corresponding Hankel matrix (Hanklet), which embeds the observability matrix of the LTI system that produced it. We train a set of HMMs (one for each gesture class) with a discriminative approach. To account for the sharing of body motion templates we allow the HMMs to share the same state space. We demonstrate by means of experiments on two publicly available datasets that, even with just considering the trajectories of the 3D joints, our method achieves state-of-the-art accuracy while competing well with methods that employ more complex models and feature representations

    Data-free metrics for Dirichlet and generalized Dirichlet mixture-based HMMs - A practical study.

    Get PDF
    Approaches to design metrics between hidden Markov models (HMM) can be divided into two classes: data-based and parameter-based. The latter has the clear advantage of being deterministic and faster but only a very few similarity measures that can be applied to mixture-based HMMs have been proposed so far. Most of these metrics apply to the discrete or Gaussian HMMs and no comparative study have been led to the best of our knowledge. With the recent development of HMMs based on the Dirichlet and generalized Dirichlet distributions for proportional data modeling, we propose to design three new parametric similarity measures between these HMMs. Extensive experiments on synthetic data show the reliability of these new measures where the existing ones fail at giving expected results when some parameters vary. Illustration on real data show the clustering capability of these measures and their potential applications

    Combinatorial optimisation for arterial image segmentation.

    Get PDF
    Cardiovascular disease is one of the leading causes of the mortality in the western world. Many imaging modalities have been used to diagnose cardiovascular diseases. However, each has different forms of noise and artifacts that make the medical image analysis field important and challenging. This thesis is concerned with developing fully automatic segmentation methods for cross-sectional coronary arterial imaging in particular, intra-vascular ultrasound and optical coherence tomography, by incorporating prior and tracking information without any user intervention, to effectively overcome various image artifacts and occlusions. Combinatorial optimisation methods are proposed to solve the segmentation problem in polynomial time. A node-weighted directed graph is constructed so that the vessel border delineation is considered as computing a minimum closed set. A set of complementary edge and texture features is extracted. Single and double interface segmentation methods are introduced. Novel optimisation of the boundary energy function is proposed based on a supervised classification method. Shape prior model is incorporated into the segmentation framework based on global and local information through the energy function design and graph construction. A combination of cross-sectional segmentation and longitudinal tracking is proposed using the Kalman filter and the hidden Markov model. The border is parameterised using the radial basis functions. The Kalman filter is used to adapt the inter-frame constraints between every two consecutive frames to obtain coherent temporal segmentation. An HMM-based border tracking method is also proposed in which the emission probability is derived from both the classification-based cost function and the shape prior model. The optimal sequence of the hidden states is computed using the Viterbi algorithm. Both qualitative and quantitative results on thousands of images show superior performance of the proposed methods compared to a number of state-of-the-art segmentation methods

    Real-time activity recognition by discerning qualitative relationships between randomly chosen visual features

    Get PDF
    In this paper, we present a novel method to explore semantically meaningful visual information and identify the discriminative spatiotemporal relationships between them for real-time activity recognition. Our approach infers human activities using continuous egocentric (first-person-view) videos of object manipulations in an industrial setup. In order to achieve this goal, we propose a random forest that unifies randomization, discriminative relationships mining and a Markov temporal structure. Discriminative relationships mining helps us to model relations that distinguish different activities, while randomization allows us to handle the large feature space and prevents over-fitting. The Markov temporal structure provides temporally consistent decisions during testing. The proposed random forest uses a discriminative Markov decision tree, where every nonterminal node is a discriminative classifier and the Markov structure is applied at leaf nodes. The proposed approach outperforms the state-of-the-art methods on a new challenging video dataset of assembling a pump system

    Latent Topic Text Representation Learning on Statistical Manifolds

    Get PDF
    The explosive growth of text data requires effective methods to represent and classify these texts. Many text learning methods have been proposed, like statistics-based methods, semantic similarity methods, and deep learning methods. The statistics-based methods focus on comparing the substructure of text, which ignores the semantic similarity between different words. Semantic similarity methods learn a text representation by training word embedding and representing text as the average vector of all words. However, these methods cannot capture the topic diversity of words and texts clearly. Recently, deep learning methods such as CNNs and RNNs have been studied. However, the vanishing gradient problem and time complexity for parameter selection limit their applications. In this paper, we propose a novel and efficient text learning framework, named Latent Topic Text Representation Learning. Our method aims to provide an effective text representation and text measurement with latent topics. With the assumption that words on the same topic follow a Gaussian distribution, texts are represented as a mixture of topics, i.e., a Gaussian mixture model. Our framework is able to effectively measure text distance to perform text categorization tasks by leveraging statistical manifolds. Experimental results on text representation and classification, and topic coherence demonstrate the effectiveness of the proposed method

    Non-Gaussian data modeling with hidden Markov models

    Get PDF
    In 2015, 2.5 quintillion bytes of data were daily generated worldwide of which 90% were unstructured data that do not follow any pre-defined model. These data can be found in a great variety of formats among them are texts, images, audio tracks, or videos. With appropriate techniques, this massive amount of data is a goldmine from which one can extract a variety of meaningful embedded information. Among those techniques, machine learning algorithms allow multiple processing possibilities from compact data representation, to data clustering, classification, analysis, and synthesis, to the detection of outliers. Data modeling is the first step for performing any of these tasks and the accuracy and reliability of this initial step is thus crucial for subsequently building up a complete data processing framework. The principal motivation behind my work is the over-use of the Gaussian assumption for data modeling in the literature. Though this assumption is probably the best to make when no information about the data to be modeled is available, in most cases studying a few data properties would make other distributions a better assumption. In this thesis, I focus on proportional data that are most commonly known in the form of histograms and that naturally arise in a number of situations such as in bag-of-words methods. These data are non-Gaussian and their modeling with distributions belonging the Dirichlet family, that have common properties, is expected to be more accurate. The models I focus on are the hidden Markov models, well-known for their capabilities to easily handle dynamic ordered multivariate data. They have been shown to be very effective in numerous fields for various applications for the last 30 years and especially became a corner stone in speech processing. Despite their extensive use in almost all computer vision areas, they are still mainly suited for Gaussian data modeling. I propose here to theoretically derive different approaches for learning and applying to real-world situations hidden Markov models based on mixtures of Dirichlet, generalized Dirichlet, Beta-Liouville distributions, and mixed data. Expectation-Maximization and variational learning approaches are studied and compared over several data sets, specifically for the task of detecting and localizing unusual events. Hybrid HMMs are proposed to model mixed data with the goal of detecting changes in satellite images corrupted by different noises. Finally, several parametric distances for comparing Dirichlet and generalized Dirichlet-based HMMs are proposed and extensively tested for assessing their robustness. My experimental results show situations in which such models are worthy to be used, but also unravel their strength and limitations
    corecore