14 research outputs found

    Action recognition based on joint trajectory maps using convolutional neural networks

    Get PDF
    Recently, Convolutional Neural Networks (ConvNets) have shown promising performances in many computer vision tasks, especially image-based recognition. How to effectively use ConvNets for video-based recognition is still an open problem. In this paper, we propose a compact, effective yet simple method to encode spatiotemporal information carried in 3D skeleton sequences into multiple 2D images, referred to as Joint Trajectory Maps (JTM), and ConvNets are adopted to exploit the discriminative features for realtime human action recognition. The proposed method has been evaluated on three public benchmarks, i.e., MSRC-12 Kinect gesture dataset (MSRC-12), G3D dataset and UTD multimodal human action dataset (UTD-MHAD) and achieved the state-of-the-art results

    Action Classification in Human Robot Interaction Cells in Manufacturing

    Get PDF
    Action recognition has become a prerequisite approach to fluent Human-Robot Interaction (HRI) due to a high degree of movement flexibility. With the improvements in machine learning algorithms, robots are gradually transitioning into more human-populated areas. However, HRI systems demand the need for robots to possess enough cognition. The action recognition algorithms require massive training datasets, structural information of objects in the environment, and less expensive models in terms of computational complexity. In addition, many such algorithms are trained on datasets derived from daily activities. The algorithms trained on non-industrial datasets may have an unfavorable impact on implementing models and validating actions in an industrial context. This study proposed a lightweight deep learning model for classifying low-level actions in an assembly setting. The model is based on optical flow feature elicitation and mobilenetV2-SSD action classification and is trained and assessed on an actual industrial activities’ dataset. The experimental outcomes show that the presented method is futuristic and does not require extensive preprocessing; therefore, it can be promising in terms of the feasibility of action recognition for mutual performance monitoring in real-world HRI applications. The test result shows 80% accuracy for low-level RGB action classes. The study’s primary objective is to generate experimental results that may be used as a reference for future HRI algorithms based on the InHard dataset

    Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton based Action Recognition

    Get PDF
    Skeleton-based human action recognition is a longstanding challenge due to its complex dynamics. Some fine-grain details of the dynamics play a vital role in classification. The existing work largely focuses on designing incremental neural networks with more complicated adjacent matrices to capture the details of joints relationships. However, they still have difficulties distinguishing actions that have broadly similar motion patterns but belong to different categories. Interestingly, we found that the subtle differences in motion patterns can be significantly amplified and become easy for audience to distinct through specified view directions, where this property haven't been fully explored before. Drastically different from previous work, we boost the performance by proposing a conceptually simple yet effective Multi-view strategy that recognizes actions from a collection of dynamic view features. Specifically, we design a novel Skeleton-Anchor Proposal (SAP) module which contains a Multi-head structure to learn a set of views. For feature learning of different views, we introduce a novel Angle Representation to transform the actions under different views and feed the transformations into the baseline model. Our module can work seamlessly with the existing action classification model. Incorporated with baseline models, our SAP module exhibits clear performance gains on many challenging benchmarks. Moreover, comprehensive experiments show that our model consistently beats down the state-of-the-art and remains effective and robust especially when dealing with corrupted data. Related code will be available on https://github.com/ideal-idea/SAP

    Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

    Full text link
    Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: https://github.com/YujieOuO/SMIE.Comment: Accepted by ACM MM 202

    Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition

    Full text link
    This paper presents Ske2Grid, a new representation learning framework for improved skeleton-based action recognition. In Ske2Grid, we define a regular convolution operation upon a novel grid representation of human skeleton, which is a compact image-like grid patch constructed and learned through three novel designs. Specifically, we propose a graph-node index transform (GIT) to construct a regular grid patch through assigning the nodes in the skeleton graph one by one to the desired grid cells. To ensure that GIT is a bijection and enrich the expressiveness of the grid representation, an up-sampling transform (UPT) is learned to interpolate the skeleton graph nodes for filling the grid patch to the full. To resolve the problem when the one-step UPT is aggressive and further exploit the representation capability of the grid patch with increasing spatial size, a progressive learning strategy (PLS) is proposed which decouples the UPT into multiple steps and aligns them to multiple paired GITs through a compact cascaded design learned progressively. We construct networks upon prevailing graph convolution networks and conduct experiments on six mainstream skeleton-based action recognition datasets. Experiments show that our Ske2Grid significantly outperforms existing GCN-based solutions under different benchmark settings, without bells and whistles. Code and models are available at https://github.com/OSVAI/Ske2GridComment: The paper of Ske2Grid is published at ICML 2023. Code and models are available at https://github.com/OSVAI/Ske2Gri

    AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation

    Full text link
    We present AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging hand-object interactions. The dataset includes synchronized egocentric and exocentric images sampled from the recent Assembly101 dataset, in which participants assemble and disassemble take-apart toys. To obtain high-quality 3D hand pose annotations for the egocentric images, we develop an efficient pipeline, where we use an initial set of manual annotations to train a model to automatically annotate a much larger dataset. Our annotation model uses multi-view feature fusion and an iterative refinement scheme, and achieves an average keypoint error of 4.20 mm, which is 85% lower than the error of the original annotations in Assembly101. AssemblyHands provides 3.0M annotated images, including 490K egocentric images, making it the largest existing benchmark dataset for egocentric 3D hand pose estimation. Using this data, we develop a strong single-view baseline of 3D hand pose estimation from egocentric images. Furthermore, we design a novel action classification task to evaluate predicted 3D hand poses. Our study shows that having higher-quality hand poses directly improves the ability to recognize actions.Comment: CVPR 2023. Project page: https://assemblyhands.github.io

    Learning action recognition model from depth and skeleton videos

    Get PDF
    Depth sensors open up possibilities of dealing with the human action recognition problem by providing 3D human skeleton data and depth images of the scene. Analysis of hu- man actions based on 3D skeleton data has become popular recently, due to its robustness and view-invariant represen- tation. However, the skeleton alone is insufficient to distin- guish actions which involve human-object interactions. In this paper, we propose a deep model which efficiently mod- els human-object interactions and intra-class variations un- der viewpoint changes. First, a human body-part model is introduced to transfer the depth appearances of body-parts to a shared view-invariant space. Second, an end-to-end learning framework is proposed which is able to effectively combine the view-invariant body-part representation from skeletal and depth images, and learn the relations between the human body-parts and the environmental objects, the interactions between different human body-parts, and the temporal structure of human actions. We have evaluated the performance of our proposed model against 15 existing techniques on two large benchmark human action recogni- tion datasets including NTU RGB+D and UWA3DII. The Experimental results show that our technique provides a significant improvement over state-of-the-art methods. 1

    Cross View Action Recognition

    Get PDF
    openCross View Action Recognition (CVAR) appraises a system's ability to recognise actions from viewpoints that are unfamiliar to the system. The state of the art methods that train on large amounts of training data rely on variation in the training data itself to increase their ability to tackle viewpoints changes. Therefore, these methods not only require a large scale dataset of appropriate classes for the application every time they train, but also correspondingly large amount of computation power for the training process leading to high costs, in terms of time, effort, funds and electrical energy. In this thesis, we propose a methodological pipeline that tackles change in viewpoint, training on small datasets and employing sustainable amounts of resources. Our method uses the optical flow input with a stream of a pre-trained model as-is to obtain a feature. Thereafter, this feature is used to train a custom designed classifier that promotes view-invariant properties. Our method only uses video information as input, in contrast to another set of methods that approach CVAR by using depth or pose input at the expense of increased sensor costs. We present a number of comparative analysis that aided the design of the pipelines, farther assessing the power of each component in the pipeline. The technique can also be adopted to existing, trained classifiers, with minimal fine-tuning, as this work demonstrates by comparing classifiers including shallow classifiers, deep pre-trained classifiers and our proposed classifier trained from scratch. Additionally, we present a set of qualitative results that promote our understanding of the relationship between viewpoints in the feature-space.openXXXII CICLO - INFORMATICA E INGEGNERIA DEI SISTEMI/ COMPUTER SCIENCE AND SYSTEMS ENGINEERING - InformaticaGoyal, Gaurv

    Human Action Recognition and Monitoring in Ambient Assisted Living Environments

    Get PDF
    Population ageing is set to become one of the most significant challenges of the 21st century, with implications for almost all sectors of society. Especially in developed countries, governments should immediately implement policies and solutions to facilitate the needs of an increasingly older population. Ambient Intelligence (AmI) and in particular the area of Ambient Assisted Living (AAL) offer a feasible response, allowing the creation of human-centric smart environments that are sensitive and responsive to the needs and behaviours of the user. In such a scenario, understand what a human being is doing, if and how he/she is interacting with specific objects, or whether abnormal situations are occurring is critical. This thesis is focused on two related research areas of AAL: the development of innovative vision-based techniques for human action recognition and the remote monitoring of users behaviour in smart environments. The former topic is addressed through different approaches based on data extracted from RGB-D sensors. A first algorithm exploiting skeleton joints orientations is proposed. This approach is extended through a multi-modal strategy that includes the RGB channel to define a number of temporal images, capable of describing the time evolution of actions. Finally, the concept of template co-updating concerning action recognition is introduced. Indeed, exploiting different data categories (e.g., skeleton and RGB information) improve the effectiveness of template updating through co-updating techniques. The action recognition algorithms have been evaluated on CAD-60 and CAD-120, achieving results comparable with the state-of-the-art. Moreover, due to the lack of datasets including skeleton joints orientations, a new benchmark named Office Activity Dataset has been internally acquired and released. Regarding the second topic addressed, the goal is to provide a detailed implementation strategy concerning a generic Internet of Things monitoring platform that could be used for checking users' behaviour in AmI/AAL contexts
    corecore