182,790 research outputs found

    Persistent Homology of Attractors For Action Recognition

    Full text link
    In this paper, we propose a novel framework for dynamical analysis of human actions from 3D motion capture data using topological data analysis. We model human actions using the topological features of the attractor of the dynamical system. We reconstruct the phase-space of time series corresponding to actions using time-delay embedding, and compute the persistent homology of the phase-space reconstruction. In order to better represent the topological properties of the phase-space, we incorporate the temporal adjacency information when computing the homology groups. The persistence of these homology groups encoded using persistence diagrams are used as features for the actions. Our experiments with action recognition using these features demonstrate that the proposed approach outperforms other baseline methods.Comment: 5 pages, Under review in International Conference on Image Processin

    Multi-Label Zero-Shot Human Action Recognition via Joint Latent Ranking Embedding

    Get PDF
    Human action recognition refers to automatic recognizing human actions from a video clip. In reality, there often exist multiple human actions in a video stream. Such a video stream is often weakly-annotated with a set of relevant human action labels at a global level rather than assigning each label to a specific video episode corresponding to a single action, which leads to a multi-label learning problem. Furthermore, there are many meaningful human actions in reality but it would be extremely difficult to collect/annotate video clips regarding all of various human actions, which leads to a zero-shot learning scenario. To the best of our knowledge, there is no work that has addressed all the above issues together in human action recognition. In this paper, we formulate a real-world human action recognition task as a multi-label zero-shot learning problem and propose a framework to tackle this problem in a holistic way. Our framework holistically tackles the issue of unknown temporal boundaries between different actions for multi-label learning and exploits the side information regarding the semantic relationship between different human actions for knowledge transfer. Consequently, our framework leads to a joint latent ranking embedding for multi-label zero-shot human action recognition. A novel neural architecture of two component models and an alternate learning algorithm are proposed to carry out the joint latent ranking embedding learning. Thus, multi-label zero-shot recognition is done by measuring relatedness scores of action labels to a test video clip in the joint latent visual and semantic embedding spaces. We evaluate our framework with different settings, including a novel data split scheme designed especially for evaluating multi-label zero-shot learning, on two datasets: Breakfast and Charades. The experimental results demonstrate the effectiveness of our framework.Comment: 27 pages, 10 figures and 7 tables. Technical report submitted to a journal. More experimental results/references were added and typos were correcte

    Memory-Augmented Temporal Dynamic Learning for Action Recognition

    Full text link
    Human actions captured in video sequences contain two crucial factors for action recognition, i.e., visual appearance and motion dynamics. To model these two aspects, Convolutional and Recurrent Neural Networks (CNNs and RNNs) are adopted in most existing successful methods for recognizing actions. However, CNN based methods are limited in modeling long-term motion dynamics. RNNs are able to learn temporal motion dynamics but lack effective ways to tackle unsteady dynamics in long-duration motion. In this work, we propose a memory-augmented temporal dynamic learning network, which learns to write the most evident information into an external memory module and ignore irrelevant ones. In particular, we present a differential memory controller to make a discrete decision on whether the external memory module should be updated with current feature. The discrete memory controller takes in the memory history, context embedding and current feature as inputs and controls information flow into the external memory module. Additionally, we train this discrete memory controller using straight-through estimator. We evaluate this end-to-end system on benchmark datasets (UCF101 and HMDB51) of human action recognition. The experimental results show consistent improvements on both datasets over prior works and our baselines.Comment: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19

    SwingBot: Learning Physical Features from In-hand Tactile Exploration for Dynamic Swing-up Manipulation

    Full text link
    Several robot manipulation tasks are extremely sensitive to variations of the physical properties of the manipulated objects. One such task is manipulating objects by using gravity or arm accelerations, increasing the importance of mass, center of mass, and friction information. We present SwingBot, a robot that is able to learn the physical features of a held object through tactile exploration. Two exploration actions (tilting and shaking) provide the tactile information used to create a physical feature embedding space. With this embedding, SwingBot is able to predict the swing angle achieved by a robot performing dynamic swing-up manipulations on a previously unseen object. Using these predictions, it is able to search for the optimal control parameters for a desired swing-up angle. We show that with the learned physical features our end-to-end self-supervised learning pipeline is able to substantially improve the accuracy of swinging up unseen objects. We also show that objects with similar dynamics are closer to each other on the embedding space and that the embedding can be disentangled into values of specific physical properties.Comment: IROS 202

    REPRESENTATION LEARNING FOR ACTION RECOGNITION

    Get PDF
    The objective of this research work is to develop discriminative representations for human actions. The motivation stems from the fact that there are many issues encountered while capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration), inter-action similarity, background motion, and occlusion of actors. Hence, obtaining a representation which can address all the variations in the same action while maintaining discrimination with other actions is a challenging task. In literature, actions have been represented either using either low-level or high-level features. Low-level features describe the motion and appearance in small spatio-temporal volumes extracted from a video. Due to the limited space-time volume used for extracting low-level features, they are not able to account for viewpoint and actor variations or variable length actions. On the other hand, high-level features handle variations in actors, viewpoints, and duration but the resulting representation is often high-dimensional which introduces the curse of dimensionality. In this thesis, we propose new representations for describing actions by combining the advantages of both low-level and high-level features. Specifically, we investigate various linear and non-linear decomposition techniques to extract meaningful attributes in both high-level and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build action-specific dictionaries. Each dictionary retains only the discriminative information for a particular action and hence reduces inter-action similarity. Then, a sparsity-based classification method is proposed to classify the low-rank representation of clips obtained using these dictionaries. We show that this representation based on dictionary learning improves the classification performance across actions. Also, a few of the actions consist of rapid body deformations that hinder the extraction of local features from body movements. Hence, we propose to use a dictionary which is trained on convolutional neural network (CNN) features of the human body in various poses to reliably identify actors from the background. Particularly, we demonstrate the efficacy of sparse representation in the identification of the human body under rapid and substantial deformation. In the first two approaches, sparsity-based representation is developed to improve discriminability using class-specific dictionaries that utilize action labels. However, developing an unsupervised representation of actions is more beneficial as it can be used to both recognize similar actions and localize actions. We propose to exploit inter-action similarity to train a universal attribute model (UAM) in order to learn action attributes (common and distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation, a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains redundant attributes of all other actions, we use factor analysis to extract a novel lowvi dimensional action-vector representation for each clip. Action-vectors are shown to suppress background motion and highlight actions of interest in both trimmed and untrimmed clips that contributes to action recognition without the help of any classifiers. It is observed during our experiments that action-vector cannot effectively discriminate between actions which are visually similar to each other. Hence, we subject action-vectors to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary information across action-vectors using different local features followed by discriminative embedding provides the best classification performance. Further, we explore non-linear embedding of action-vectors using Siamese networks especially for fine-grained action recognition. A visualization of the hidden layer output in Siamese networks shows its ability to effectively separate visually similar actions. This leads to better classification performance than linear embedding on fine-grained action recognition. All of the above approaches are presented on large unconstrained datasets with hundreds of examples per action. However, actions in surveillance videos like snatch thefts are difficult to model because of the diverse variety of scenarios in which they occur and very few labeled examples. Hence, we propose to utilize the universal attribute model (UAM) trained on large action datasets to represent such actions. Specifically, we show that there are similarities between certain actions in the large datasets with snatch thefts which help in extracting a representation for snatch thefts using the attributes from the UAM. This representation is shown to be effective in distinguishing snatch thefts from regular actions with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing actions which provide better discrimination than existing representations. The first approach presents a dictionary learning based sparse representation for effective discrimination of actions. Also, we propose a sparse representation for the human body based on dictionaries in order to recognize actions with rapid body deformations. In the next approach, a low-dimensional representation called action-vector for unsupervised action recognition is presented. Further, linear and non-linear embedding of action-vectors is proposed for addressing inter-action similarity and fine-grained action recognition, respectively. Finally, we propose a representation for locating snatch thefts among thousands of regular interactions in surveillance videos

    Combining Text and Image Knowledge with GANs for Zero-Shot Action Recognition in Videos

    Get PDF
    The recognition of actions in videos is an active research area in machine learning, relevant to multiple domains such as health monitoring, security and social media analysis. Zero-Shot Action Recognition (ZSAR) is a challenging problem in which models are trained to identify action classes that have not been seen during the training process. According to the literature, the most promising ZSAR approaches make use of Generative Adversarial Networks (GANs). GANs can synthesise visual embeddings for unseen classes conditioned on either textual information or images related to the class labels. In this paper, we propose a Dual-GAN approach based on the VAEGAN model to prove that the fusion of visual and textual-based knowledge sources is an effective way to improve ZSAR performance. We conduct empirical ZSAR experiments of our approach on the UCF101 dataset. We apply the following embedding fusion methods for combining text-driven and image-driven information: averaging, summation, maximum, and minimum. Our best result from Dual-GAN model is achieved with the maximum embedding fusion approach that results in an average accuracy of 46.37%, which is improved by 5.37% at least compared to the leading approaches
    corecore