141 research outputs found

    Discovery and recognition of motion primitives in human activities

    Get PDF
    We present a novel framework for the automatic discovery and recognition of motion primitives in videos of human activities. Given the 3D pose of a human in a video, human motion primitives are discovered by optimizing the `motion flux', a quantity which captures the motion variation of a group of skeletal joints. A normalization of the primitives is proposed in order to make them invariant with respect to a subject anatomical variations and data sampling rate. The discovered primitives are unknown and unlabeled and are unsupervisedly collected into classes via a hierarchical non-parametric Bayes mixture model. Once classes are determined and labeled they are further analyzed for establishing models for recognizing discovered primitives. Each primitive model is defined by a set of learned parameters. Given new video data and given the estimated pose of the subject appearing on the video, the motion is segmented into primitives, which are recognized with a probability given according to the parameters of the learned models. Using our framework we build a publicly available dataset of human motion primitives, using sequences taken from well-known motion capture datasets. We expect that our framework, by providing an objective way for discovering and categorizing human motion, will be a useful tool in numerous research fields including video analysis, human inspired motion generation, learning by demonstration, intuitive human-robot interaction, and human behavior analysis

    Video characterization based on activity clustering

    Get PDF

    A preliminary study of micro-gestures:dataset collection and analysis with multi-modal dynamic networks

    Get PDF
    Abstract. Micro-gestures (MG) are gestures that people performed spontaneously during communication situations. A preliminary exploration of Micro-Gesture is made in this thesis. By collecting recorded sequences of body gestures in a spontaneous state during games, a MG dataset is built through Kinect V2. A novel term ‘micro-gesture’ is proposed by analyzing the properties of MG dataset. Implementations of two sets of neural network architectures are achieved for micro-gestures segmentation and recognition task, which are the DBN-HMM model and the 3DCNN-HMM model for skeleton data and RGB-D data respectively. We also explore a method for extracting neutral states used in the HMM structure by detecting the activity level of the gesture sequences. The method is simple to derive and implement, and proved to be effective. The DBN-HMM and 3DCNN-HMM architectures are evaluated on MG dataset and optimized for the properties of micro-gestures. Experimental results show that we are able to achieve micro-gesture segmentation and recognition with satisfied accuracy with these two models. The work we have done about the micro-gestures in this thesis also explores a new research path for gesture recognition. Therefore, we believe that our work could be widely used as a baseline for future research on micro-gestures

    Exploring the Applicability of Low‑Shot Learning in Mining Software Repositories

    Get PDF
    Background: Despite the well-documented and numerous recent successes of deep learning, the application of standard deep architectures to many classification problems within empirical software engineering remains problematic due to the large volumes of labeled data required for training. Here we make the argument that, for some problems, this hurdle can be overcome by taking advantage of low-shot learning in combination with simpler deep architectures that reduce the total number of parameters that need to be learned. Findings: We apply low-shot learning to the task of classifying UML class and sequence diagrams from Github, and demonstrate that surprisingly good performance can be achieved by using only tens or hundreds of examples for each category when paired with an appropriate architecture. Using a large, off-the-shelf architecture, on the other hand, doesn’t perform beyond random guessing even when trained on thousands of samples. Conclusion: Our findings suggest that identifying problems within empirical software engineering that lend themselves to low-shot learning could accelerate the adoption of deep learning algorithms within the empirical software engineering community

    Articulated motion discovery using pairs of trajectories

    Get PDF
    We propose an unsupervised approach for discovering characteristic motion patterns in videos of highly artic-ulated objects performing natural, unscripted behaviors, such as tigers in the wild. We discover consistent patterns in a bottom-up manner by analyzing the relative displace-ments of large numbers of ordered trajectory pairs through time, such that each trajectory is attached to a different moving part on the object. The pairs of trajectories de-scriptor relies entirely on motion and is more discriminative than state-of-the-art features that employ single trajecto-ries. Our method generates temporal video intervals, each automatically trimmed to one instance of the discovered behavior, and clusters them by type (e.g., running, turn-ing head, drinking water). We present experiments on two datasets: dogs from YouTube-Objects and a new dataset of National Geographic tiger videos. Results confirm that our proposed descriptor outperforms existing appearance- and trajectory-based descriptors (e.g., HOG and DTFs) on both datasets and enables us to segment unconstrained animal video into intervals containing single behaviors. 1

    Learning Hierarchical Representations For Video Analysis Using Deep Learning

    Get PDF
    With the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non-linear transformations of the data, with the goal of yielding more abstract and ultimately more useful representations. The advantages of the deep models are three fold: 1) They learn the features directly from the raw signal in contrast to the hand-designed features. 2) The learning can be unsupervised, which is suitable for large data where labeling all the data is expensive and unpractical. 3) They learn a hierarchy of features one level at a time and the layerwise stacking of feature extraction, this often yields better representations. However, not many deep learning models have been proposed to solve the problems in video analysis, especially videos “in a wild”. Most of them are either dealing with simple datasets, or limited to the low-level local spatial-temporal feature descriptors for action recognition. Moreover, as the learning algorithms are unsupervised, the learned features preserve generative properties rather than the discriminative ones which are more favorable in the classification tasks. In this context, the thesis makes two major contributions. First, we propose several formulations and extensions of deep learning methods which learn hierarchical representations for three challenging video analysis tasks, including complex event recognition, object detection in videos and measuring action similarity. The proposed methods are extensively demonstrated for each work on the state-of-the-art challenging datasets. Besides learning the low-level local features, higher level representations are further designed to be learned in the context of applications. The data-driven concept representations and sparse representation of the events are learned for complex event recognition; the representations for object body parts iii and structures are learned for object detection in videos; and the relational motion features and similarity metrics between video pairs are learned simultaneously for action verification. Second, in order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features a better discriminative ability. Second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experiments with quantitative and qualitative results on the tasks of human detection and action verification demonstrate the superiority of our proposed models

    Deep Architectures for Visual Recognition and Description

    Get PDF
    In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included. In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased. This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, ‘Bharatnatyam’. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy

    Robust Feature-Based Automated Multi-View Human Action Recognition System

    Full text link
    © 2013 IEEE. Automated human action recognition has the potential to play an important role in public security, for example, in relation to the multiview surveillance videos taken in public places, such as train stations or airports. This paper compares three practical, reliable, and generic systems for multiview video-based human action recognition, namely, the nearest neighbor classifier, Gaussian mixture model classifier, and the nearest mean classifier. To describe the different actions performed in different views, view-invariant features are proposed to address multiview action recognition. These features are obtained by extracting the holistic features from different temporal scales which are modeled as points of interest which represent the global spatial-temporal distribution. Experiments and cross-data testing are conducted on the KTH, WEIZMANN, and MuHAVi datasets. The system does not need to be retrained when scenarios are changed which means the trained database can be applied in a wide variety of environments, such as view angle or background changes. The experiment results show that the proposed approach outperforms the existing methods on the KTH and WEIZMANN datasets

    Behavior Discovery and Alignment of Articulated Object Classes from Unstructured Video

    Get PDF
    We propose an automatic system for organizing the content of a collection of unstructured videos of an articulated object class (e.g. tiger, horse). By exploiting the recurring motion patterns of the class across videos, our system: 1) identifies its characteristic behaviors; and 2) recovers pixel-to-pixel alignments across different instances. Our system can be useful for organizing video collections for indexing and retrieval. Moreover, it can be a platform for learning the appearance or behaviors of object classes from Internet video. Traditional supervised techniques cannot exploit this wealth of data directly, as they require a large amount of time-consuming manual annotations. The behavior discovery stage generates temporal video intervals, each automatically trimmed to one instance of the discovered behavior, clustered by type. It relies on our novel motion representation for articulated motion based on the displacement of ordered pairs of trajectories (PoTs). The alignment stage aligns hundreds of instances of the class to a great accuracy despite considerable appearance variations (e.g. an adult tiger and a cub). It uses a flexible Thin Plate Spline deformation model that can vary through time. We carefully evaluate each step of our system on a new, fully annotated dataset. On behavior discovery, we outperform the state-of-the-art Improved DTF descriptor. On spatial alignment, we outperform the popular SIFT Flow algorithm.Comment: 19 pages, 19 figure, 3 tables. arXiv admin note: substantial text overlap with arXiv:1411.788
    • 

    corecore