8 research outputs found

    Hierarchical long short-term memory for action recognition based on 3D skeleton joints from Kinect sensor

    Get PDF
    Action recognition has been used in a wide range of applications such as human-computer interaction, intelligent video surveillance systems, video summarization, and robotics. Recognizing action is important for intelligent agents to understand, learn and interact with the environment. The recent technology that allows the acquisition of RGB+D and 3D skeleton data and a deep learning model's development significantly increases the action recognition model's performance. In this research, hierarchical Long Sort-Term Memory is proposed to recognize action based on 3D skeleton joints from Kinect sensor. The model uses the 3D axis of skeleton joints and groups each joint in the axis into parts, namely, spine, left and right arm, left and right hand, and left and right leg. To fit the hierarchically structured layers of LSTM, the parts are concatenated into spine, arms, hands, and legs and then concatenated into the body. The model crosses the body in each axis into a single final body and fed to the final layer to classify the action. The performance is measured using cross-view and cross-subject evaluation and achieves accuracy 0.854 and 0.837, respectively, from the 10 action classes of the NTU RGB+D dataset

    Two-Stage Human Activity Recognition Using 2D-ConvNet

    Get PDF
    There is huge requirement of continuous intelligent monitoring system for human activity recognition in various domains like public places, automated teller machines or healthcare sector. Increasing demand of automatic recognition of human activity in these sectors and need to reduce the cost involved in manual surveillance have motivated the research community towards deep learning techniques so that a smart monitoring system for recognition of human activities can be designed and developed. Because of low cost, high resolution and ease of availability of surveillance cameras, the authors developed a new two-stage intelligent framework for detection and recognition of human activity types inside the premises. This paper, introduces a novel framework to recognize single-limb and multi-limb human activities using a Convolution Neural Network. In the first phase single-limb and multi-limb activities are separated. Next, these separated single and multi-limb activities have been recognized using sequence-classification. For training and validation of our framework we have used the UTKinect-Action Dataset having 199 actions sequences performed by 10 users. We have achieved an overall accuracy of 97.88% in real-time recognition of the activity sequences

    3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information

    Get PDF
    This work describes an end-to-end approach for real-time human action recognition from raw depth image-sequences. The proposal is based on a 3D fully convolutional neural network, named 3DFCNN, which automatically encodes spatio-temporal patterns from raw depth sequences. The described 3D-CNN allows actions classification from the spatial and temporal encoded information of depth sequences. The use of depth data ensures that action recognition is carried out protecting people"s privacy, since their identities can not be recognized from these data. The proposed 3DFCNN has been optimized to reach a good performance in terms of accuracy while working in real-time. Then, it has been evaluated and compared with other state-of-the-art systems in three widely used public datasets with different characteristics, demonstrating that 3DFCNN outperforms all the non-DNNbased state-of-the-art methods with a maximum accuracy of 83.6% and obtains results that are comparable to the DNN-based approaches, while maintaining a much lower computational cost of 1.09 seconds, what significantly increases its applicability in real-world environments.Agencia Estatal de InvestigaciónUniversidad de Alcal

    Deep Learning for Dense Interpretation of Video: Survey of Various Approach, Challenges, Datasets and Metrics

    Get PDF
    Video interpretation has garnered considerable attention in computer vision and natural language processing fields due to the rapid expansion of video data and the increasing demand for various applications such as intelligent video search, automated video subtitling, and assistance for visually impaired individuals. However, video interpretation presents greater challenges due to the inclusion of both temporal and spatial information within the video. While deep learning models for images, text, and audio have made significant progress, efforts have recently been focused on developing deep networks for video interpretation. A thorough evaluation of current research is necessary to provide insights for future endeavors, considering the myriad techniques, datasets, features, and evaluation criteria available in the video domain. This study offers a survey of recent advancements in deep learning for dense video interpretation, addressing various datasets and the challenges they present, as well as key features in video interpretation. Additionally, it provides a comprehensive overview of the latest deep learning models in video interpretation, which have been instrumental in activity identification and video description or captioning. The paper compares the performance of several deep learning models in this field based on specific metrics. Finally, the study summarizes future trends and directions in video interpretation

    Human activity detection and action recognition in videos using convolutional neural networks

    Get PDF
    Human activity recognition from video scenes has become a significant area of research in the field of computer vision applications. Action recognition is one of the most challenging problems in the area of video analysis and it finds applications in human-computer interaction, anomalous activity detection, crowd monitoring and patient monitoring. Several approaches have been presented for human activity recognition using machine learning techniques. The main aim of this work is to detect and track human activity, and classify actions for two publicly available video databases. In this work, a novel approach of feature extraction from video sequence by combining Scale Invariant Feature Transform and optical flow computation are used where shape, gradient and orientation features are also incorporated for robust feature formulation. Tracking of human activity in the video is implemented using the Gaussian Mixture Model. Convolutional Neural Network based classification approach is used for database training and testing purposes. The activity recognition performance is evaluated for two public datasets namely Weizmann dataset and Kungliga Tekniska Hogskolan dataset with action recognition accuracy of 98.43% and 94.96%, respectively. Experimental and comparative studies have shown that the proposed approach outperformed state-of the art techniques

    Human Action Recognition by Learning Spatio-Temporal Features with Deep Neural Networks

    No full text
    Human action recognition plays a crucial role in various applications, including video surveillance, human-computer interaction, and activity analysis. This paper presents a study on human action recognition by leveraging CNN-LSTM architecture with an attention model. The proposed approach aims to capture both spatial and temporal information from videos in order to recognize human actions. We utilize the UCF-101 and UCF-50 datasets, which are widely used benchmark datasets for action recognition. The UCF-101 dataset consists of 101 action classes, while the UCF-50 dataset comprises 50 action classes, both encompassing diverse human activities. Our CNN-LSTM model integrates a CNN as the feature extractor to capture spatial information from video frames. Subsequently, the extracted features are fed into an LSTM network to capture temporal dependencies and sequence information. To enhance the discriminative power of the model, an attention model is incorporated to improve the activation patterns and highlight relevant features. Furthermore, the study provides insights into the importance of leveraging both spatial and temporal information for accurate action recognition. The findings highlight the efficacy of the CNN-LSTM architecture with an attention model in capturing meaningful patterns in video sequences and improving action recognition accuracy. You should leave 8 mm of space above the abstract and 10 mm after the abstract. The heading Abstract should be typed in bold 9-point Arial. The body of the abstract should be typed in normal 9-point Times in a single paragraph, immediately following the heading. The text should be set to 1 line spacing. The abstract should be centred across the page, indented 17 mm from the left and right page margins and justified. It should not normally exceed 200 words
    corecore