425 research outputs found

    Spatiotemporal visual analysis of human actions

    No full text
    In this dissertation we propose four methods for the recognition of human activities. In all four of them, the representation of the activities is based on spatiotemporal features that are automatically detected at areas where there is a significant amount of independent motion, that is, motion that is due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features throughout this dissertation. The algorithms presented, however, can be used with any kind of features, as long as the latter are well localized and have a well-defined area of support in space and time. We introduce the utilized spatiotemporal salient points in the first method presented in this dissertation. By extending previous work on spatial saliency, we measure the variations in the information content of pixel neighborhoods both in space and time, and detect the points at the locations and scales for which this information content is locally maximized. In this way, an activity is represented as a collection of spatiotemporal salient points. We propose an iterative linear space-time warping technique in order to align the representations in space and time and propose to use Relevance Vector Machines (RVM) in order to classify each example into an action category. In the second method proposed in this dissertation we propose to enhance the acquired representations of the first method. More specifically, we propose to track each detected point in time, and create representations based on sets of trajectories, where each trajectory expresses how the information engulfed by each salient point evolves over time. In order to deal with imperfect localization of the detected points, we augment the observation model of the tracker with background information, acquired using a fully automatic background estimation algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels. In addition, we perform experiments where the tracked templates are localized on specific parts of the body, like the hands and the head, and we further augment the tracker’s observation model using a human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm (LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and RVMs for classification. In the third method that we propose, we assume that neighboring salient points follow a similar motion. This is in contrast to the previous method, where each salient point was tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are extracted across the whole dataset are subsequently clustered in order to create a codebook, which is used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for classification. The fourth and last method addresses the joint problem of localization and recognition of human activities depicted in unsegmented image sequences. Its main contribution is the use of an implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal localization of characteristic ensembles of spatiotemporal features. The latter are localized around automatically detected salient points. Evidence for the spatiotemporal localization of the activity is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct class-specific spatiotemporal models, which encode where in space and time each codeword ensemble appears in the training set. During testing, each activated codeword ensemble casts probabilistic votes concerning the spatiotemporal localization of the activity, according to the information stored during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume which potentially engulfs the activity, and is verified by performing action category classification with an RVM classifier

    Recognising activities by jointly modelling actions and their effects

    Get PDF
    With the rapid increase in adoption of consumer technologies, including inexpensive but powerful hardware, robotics appears poised at the cusp of widespread deployment in human environments. A key barrier that still prevents this is the machine understanding and interpretation of human activity, through a perceptual medium such as computer vision, or RBG-D sensing such as with the Microsoft Kinect sensor. This thesis contributes novel video-based methods for activity recognition. Specifically, the focus is on activities that involve interactions between the human user and objects in the environment. Based on streams of poses and object tracking, machine learning models are provided to recognize various of these interactions. The thesis main contributions are (1) a new model for interactions that explicitly learns the human-object relationships through a latent distributed representation, (2) a practical framework for labeling chains of manipulation actions in temporally extended activities and (3) an unsupervised sequence segmentation technique that relies on slow feature analysis and spectral clustering. These techniques are validated by experiments with publicly available data sets, such as the Cornell CAD-120 activity corpus which is one of the most extensive publicly available such data sets that is also annotated with ground truth information. Our experiments demonstrate the advantages of the proposed methods, over and above state of the art alternatives from the recent literature on sequence classifiers

    Multiple Instance Learning: A Survey of Problem Characteristics and Applications

    Full text link
    Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research

    Sampling - Variational Auto Encoder - Ensemble: In the Quest of Explainable Artificial Intelligence

    Full text link
    Explainable Artificial Intelligence (XAI) models have recently attracted a great deal of interest from a variety of application sectors. Despite significant developments in this area, there are still no standardized methods or approaches for understanding AI model outputs. A systematic and cohesive framework is also increasingly necessary to incorporate new techniques like discriminative and generative models to close the gap. This paper contributes to the discourse on XAI by presenting an empirical evaluation based on a novel framework: Sampling - Variational Auto Encoder (VAE) - Ensemble Anomaly Detection (SVEAD). It is a hybrid architecture where VAE combined with ensemble stacking and SHapley Additive exPlanations are used for imbalanced classification. The finding reveals that combining ensemble stacking, VAE, and SHAP can. not only lead to better model performance but also provide an easily explainable framework. This work has used SHAP combined with Permutation Importance and Individual Conditional Expectations to create a powerful interpretability of the model. The finding has an important implication in the real world, where the need for XAI is paramount to boost confidence in AI applications.Comment: 8 pages, 10 figures, IEEE conference (IEIT 2023
    • …
    corecore