294,937 research outputs found

    Neural Information Processing Techniques for Skeleton-Based Action Recognition

    Get PDF
    Human action recognition is one of the core research problems in human-centered computing and computer vision. This problem lays the technical foundations for a wide range of applications, such as human-robot interaction, virtual reality, sports analysis, and so on. Recently, skeleton-based action recognition, as a subarea of action recognition, is swiftly accumulating attention and popularity. The task is to recognize actions performed by human articulation points. Compared with other data modalities, 3D human skeleton representations have extensive unique desirable characteristics, including succinctness, robustness, racial-impartiality, and many more. Currently, research on skeleton-based action recognition primarily concentrates on designing new spatial and temporal neural network operators to more thoroughly extract action features. In this thesis, on the other hand, we aim to propose methods that can be compatibly equipped with existing approaches. That is, we desire to further collaboratively strengthen current algorithms rather than forming competition with them. To this end, we propose five techniques and one large-scale human skeleton dataset. First, we present fusing higher-order spatial features in the form of angular encoding into modern architectures to robustly capture the relationships between joints and body parts. Many skeleton-based action recognizers are confused by actions that have similar motion trajectories. The proposed angular features robustly capture the relationships between joints and body parts, achieving new state-of-the-art accuracy in two large benchmarks, including NTU60 and NTU120, while employing fewer parameters and reduced run time. Second, we design two temporal accessories that facilitate existing skeleton-based action recognizers to more richly capture motion patterns. Specifically, the proposed two modules support alleviating the adverse influence of signal noise as well as guide networks to explicitly capture the sequence's chronological order. The two accessories facilitate a simple skeleton-based action recognizer to achieve new state-of-the-art (SOTA) accuracy on two large benchmark datasets. Third, we devise a new form of graph neural network as a potential new network backbone for extracting topological information of skeletonized human sequences. The proposed graph neural network is capable of learning relative positions between the nodes within a graph, substantially improving performance on various synthetic and real-world graph datasets while enjoying stable scalability. Fourth, we propose an information-theoretic technique to address imbalanced datasets, \ie, the categorical distribution of class labels is non-uniform. The proposed method improves classification accuracy when the training dataset is imbalanced. Our result provides an alternative view: neural network classifiers are mutual information estimators. Fifth, we present a neural crowdsourcing method to correct human errors. When annotating skeleton-based actions, human annotators may not reach a unanimous action category due to ambiguities of skeleton motion trajectories from different actions. The proposed method can help unify different annotated results into a single label. Sixth, we collect a large-scale human skeleton dataset for benchmarking existing methods and defining new problems for achieving the commercialization of skeleton-based action recognition. Using ANUBIS, we evaluate the performance of current skeleton-based action recognizers. At the end of this thesis, we conclude our proposed methods and propose four technique problems that may need to be solved first in order to commercialize skeleton-based action recognition in reality

    Discriminatively Trained Latent Ordinal Model for Video Classification

    Full text link
    We study the problem of video classification for facial analysis and human action recognition. We propose a novel weakly supervised learning method that models the video as a sequence of automatically mined, discriminative sub-events (eg. onset and offset phase for "smile", running and jumping for "highjump"). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF -- it extends such frameworks to model the ordinal aspect in the videos, approximately. We obtain consistent improvements over relevant competitive baselines on four challenging and publicly available video based facial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations and on three challenging human action datasets. We also validate the method with qualitative results and show that they largely support the intuitions behind the method.Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1604.0150

    From Traditional to Modern : Domain Adaptation for Action Classification in Short Social Video Clips

    Full text link
    Short internet video clips like vines present a significantly wild distribution compared to traditional video datasets. In this paper, we focus on the problem of unsupervised action classification in wild vines using traditional labeled datasets. To this end, we use a data augmentation based simple domain adaptation strategy. We utilise semantic word2vec space as a common subspace to embed video features from both, labeled source domain and unlablled target domain. Our method incrementally augments the labeled source with target samples and iteratively modifies the embedding function to bring the source and target distributions together. Additionally, we utilise a multi-modal representation that incorporates noisy semantic information available in form of hash-tags. We show the effectiveness of this simple adaptation technique on a test set of vines and achieve notable improvements in performance.Comment: 9 pages, GCPR, 201

    Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web

    Full text link
    Recently, attempts have been made to collect millions of videos to train Convolutional Neural Network (CNN) models for action recognition in videos. However, curating such large-scale video datasets requires immense human labor, and training CNNs on millions of videos demands huge computational resources. In contrast, collecting action images from the Web is much easier and training on images requires much less computation. In addition, labeled web images tend to contain discriminative action poses, which highlight discriminative portions of a video’s temporal progression. Through extensive experiments, we explore the question of whether we can utilize web action images to train better CNN models for action recognition in videos. We collect 23.8K manually filtered images from the Web that depict the 101 actions in the UCF101 action video dataset. We show that by utilizing web action images along with videos in training, significant performance boosts of CNN models can be achieved. We also investigate the scalability of the process by leveraging crawled web images (unfiltered) for UCF101 and ActivityNet. Using unfiltered images we can achieve performance improvements that are on-par with using filtered images. This means we can further reduce annotation labor and easily scale-up to larger problems. We also shed light on an artifact of finetuning CNN models that reduces the effective parameters of the CNN and show that using web action images can significantly alleviate this problem.https://arxiv.org/pdf/1512.07155v1.pdfFirst author draf

    A robust and efficient video representation for action recognition

    Get PDF
    This paper introduces a state-of-the-art video representation and applies it to efficient action recognition and detection. We first propose to improve the popular dense trajectory features by explicit camera motion estimation. More specifically, we extract feature point matches between frames using SURF descriptors and dense optical flow. The matches are used to estimate a homography with RANSAC. To improve the robustness of homography estimation, a human detector is employed to remove outlier matches from the human body as human motion is not constrained by the camera. Trajectories consistent with the homography are considered as due to camera motion, and thus removed. We also use the homography to cancel out camera motion from the optical flow. This results in significant improvement on motion-based HOF and MBH descriptors. We further explore the recent Fisher vector as an alternative feature encoding approach to the standard bag-of-words histogram, and consider different ways to include spatial layout information in these encodings. We present a large and varied set of evaluations, considering (i) classification of short basic actions on six datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that our improved trajectory features significantly outperform previous dense trajectories, and that Fisher vectors are superior to bag-of-words encodings for video recognition tasks. In all three tasks, we show substantial improvements over the state-of-the-art results
    • …
    corecore