14,935 research outputs found

    An Ensemble of Knowledge Sharing Models for Dynamic Hand Gesture Recognition

    Full text link
    The focus of this paper is dynamic gesture recognition in the context of the interaction between humans and machines. We propose a model consisting of two sub-networks, a transformer and an ordered-neuron long-short-term-memory (ON-LSTM) based recurrent neural network (RNN). Each sub-network is trained to perform the task of gesture recognition using only skeleton joints. Since each sub-network extracts different types of features due to the difference in architecture, the knowledge can be shared between the sub-networks. Through knowledge distillation, the features and predictions from each sub-network are fused together into a new fusion classifier. In addition, a cyclical learning rate can be used to generate a series of models that are combined in an ensemble, in order to yield a more generalizable prediction. The proposed ensemble of knowledge-sharing models exhibits an overall accuracy of 86.11% using only skeleton information, as tested using the Dynamic Hand Gesture-14/28 datasetComment: Accepted at International Joint Conference on Neural Networ

    Real-time human action recognition using raw depth video-based recurrent neural networks

    Get PDF
    This work proposes and compare two different approaches for real-time human action recognition (HAR) from raw depth video sequences. Both proposals are based on the convolutional long short-term memory unit, namely ConvLSTM, with differences in the architecture and the long-term learning. The former uses a video-length adaptive input data generator (stateless) whereas the latter explores the stateful ability of general recurrent neural networks but is applied in the particular case of HAR. This stateful property allows the model to accumulate discriminative patterns from previous frames without compromising computer memory. Furthermore, since the proposal uses only depth information, HAR is carried out preserving the privacy of people in the scene, since their identities can not be recognized. Both neural networks have been trained and tested using the large-scale NTU RGB+D dataset. Experimental results show that the proposed models achieve competitive recognition accuracies with lower computational cost compared with state-of-the-art methods and prove that, in the particular case of videos, the rarely-used stateful mode of recurrent neural networks significantly improves the accuracy obtained with the standard mode. The recognition accuracies obtained are 75.26% (CS) and 75.45% (CV) for the stateless model, with an average time consumption per video of 0.21 s, and 80.43% (CS) and 79.91%(CV) with 0.89 s for the stateful one.Agencia Estatal de InvestigaciónUniversidad de Alcal

    High Level Learning Using the Temporal Features of Human Demonstrated Sequential Tasks

    Get PDF
    Modelling human-led demonstrations of high-level sequential tasks is fundamental to a number of practical inference applications including vision-based policy learning and activity recognition. Demonstrations of these tasks are captured as videos with long durations and similar spatial contents. Learning from this data is challenging since inference cannot be conducted solely on spatial feature presence and must instead consider how spatial features play out across time. To be successful these temporal representations must generalize to variations in the duration of activities and be able to capture relationships between events expressed across the scale of an entire video. Contemporary deep learning architectures that represent time (convolution-based and Recurrent Neural Networks) do not address these concerns. Representations learned by these models describe temporal features in terms of fixed durations such as minutes, seconds, and frames. They are also developed sequentially and must use unreasonably large models to capture temporal features expressed at scale. Probabilistic temporal models have been successful in representing the temporal information of videos in a duration invariant manner that is robust to scale, however, this has only been accomplished through the use of user-defined spatial features. Such abstractions make unrealistic assumptions about the content being expressed in these videos, the quality of the perception model, and they also limit the potential applications of trained models. To that end, I present D-ITR-L, a temporal wrapper that extends the spatial features extracted from a typically CNN architecture and transforms them into temporal features. D-ITR-L-derived temporal features are duration invariant and can identify temporal relationships between events at the scale of a full video. Validation of this claim is conducted through various vision-based policy learning and action recognition settings. Additionally, these studies show that challenging visual domains such as human-led demonstration of high-level sequential tasks can be effectively represented when using a D-ITR-L-based model

    High Level Learning Using the Temporal Features of Human Demonstrated Sequential Tasks

    Get PDF
    Modelling human-led demonstrations of high-level sequential tasks is fundamental to a number of practical inference applications including vision-based policy learning and activity recognition. Demonstrations of these tasks are captured as videos with long durations and similar spatial contents. Learning from this data is challenging since inference cannot be conducted solely on spatial feature presence and must instead consider how spatial features play out across time. To be successful these temporal representations must generalize to variations in the duration of activities and be able to capture relationships between events expressed across the scale of an entire video. Contemporary deep learning architectures that represent time (convolution-based and Recurrent Neural Networks) do not address these concerns. Representations learned by these models describe temporal features in terms of fixed durations such as minutes, seconds, and frames. They are also developed sequentially and must use unreasonably large models to capture temporal features expressed at scale. Probabilistic temporal models have been successful in representing the temporal information of videos in a duration invariant manner that is robust to scale, however, this has only been accomplished through the use of user-defined spatial features. Such abstractions make unrealistic assumptions about the content being expressed in these videos, the quality of the perception model, and they also limit the potential applications of trained models. To that end, I present D-ITR-L, a temporal wrapper that extends the spatial features extracted from a typically CNN architecture and transforms them into temporal features. D-ITR-L-derived temporal features are duration invariant and can identify temporal relationships between events at the scale of a full video. Validation of this claim is conducted through various vision-based policy learning and action recognition settings. Additionally, these studies show that challenging visual domains such as human-led demonstration of high-level sequential tasks can be effectively represented when using a D-ITR-L-based model

    3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information

    Get PDF
    This work describes an end-to-end approach for real-time human action recognition from raw depth image-sequences. The proposal is based on a 3D fully convolutional neural network, named 3DFCNN, which automatically encodes spatio-temporal patterns from raw depth sequences. The described 3D-CNN allows actions classification from the spatial and temporal encoded information of depth sequences. The use of depth data ensures that action recognition is carried out protecting people"s privacy, since their identities can not be recognized from these data. The proposed 3DFCNN has been optimized to reach a good performance in terms of accuracy while working in real-time. Then, it has been evaluated and compared with other state-of-the-art systems in three widely used public datasets with different characteristics, demonstrating that 3DFCNN outperforms all the non-DNNbased state-of-the-art methods with a maximum accuracy of 83.6% and obtains results that are comparable to the DNN-based approaches, while maintaining a much lower computational cost of 1.09 seconds, what significantly increases its applicability in real-world environments.Agencia Estatal de InvestigaciónUniversidad de Alcal

    Cubaneo In Latin Piano: A Parametric Approach To Gesture, Texture, And Motivic Variation

    Get PDF
    ABSTRACT CUBANEO IN LATIN PIANO: A PARAMETRIC APPROACH TO GESTURE, TEXTURE, AND MOTIVIC VARIATION COPYRIGHT Orlando Enrique Fiol 2018 Dr. Carol A. Muller Over the past century of recorded evidence, Cuban popular music has undergone great stylistic changes, especially regarding the piano tumbao. Hybridity in the Cuban/Latin context has taken place on different levels to varying extents involving instruments, genres, melody, harmony, rhythm, and musical structures. This hybridity has involved melding, fusing, borrowing, repurposing, adopting, adapting, and substituting. But quantifying and pinpointing these processes has been difficult because each variable or parameter embodies a history and a walking archive of sonic aesthetics. In an attempt to classify and quantify precise parameters involved in hybridity, this dissertation presents a paradigmatic model, organizing music into vocabularies, repertories, and abstract procedures. Cuba\u27s pianistic vocabularies are used very interactively, depending on genre, composite ensemble texture, vocal timbre, performing venue, and personal taste. These vocabularies include: melodic phrases, harmonic progressions, rhythmic cells and variation schemes to replace repetition with methodical elaboration of the piano tumbao as a main theme. These pianistic vocabularies comprise what we actually hear. Repertories, such as pre-composed songs, ensemble arrangements, and open- ended montuno and solo sections, situate and contextualize what we hear in real life musical performances. Abstract procedures are the thoughts, aesthetics, intentions, and parametric rules governing what Cuban/Latin pianists consider possible. Abstract procedures alter vocabularies by displacing, expanding, contracting, recombining, permuting, and layering them. As Cuba\u27s popular musics find homes in its musical diaspora (the United States, Latin America and Europe), Cuban pianists have sought to differentiate their craft from global salsa and Latin jazz pianists. Expanding the piano\u27s gestural/textural vocabulary beyond pre-Revolutionary traditions and performance practices, the timba piano tumbao is a powerful marker of Cuban identity and musical pride, transcending national borders and cultural boundaries
    corecore