14 research outputs found
Action recognition based on joint trajectory maps using convolutional neural networks
Recently, Convolutional Neural Networks (ConvNets) have shown promising performances in many computer vision tasks, especially image-based recognition. How to effectively use ConvNets for video-based recognition is still an open problem. In this paper, we propose a compact, effective yet simple method to encode spatiotemporal information carried in 3D skeleton sequences into multiple 2D images, referred to as Joint Trajectory Maps (JTM), and ConvNets are adopted to exploit the discriminative features for realtime human action recognition. The proposed method has been evaluated on three public benchmarks, i.e., MSRC-12 Kinect gesture dataset (MSRC-12), G3D dataset and UTD multimodal human action dataset (UTD-MHAD) and achieved the state-of-the-art results
Action Classification in Human Robot Interaction Cells in Manufacturing
Action recognition has become a prerequisite approach to fluent Human-Robot Interaction (HRI) due to a high degree of movement flexibility. With the improvements in machine learning algorithms, robots are gradually transitioning into more human-populated areas. However, HRI systems demand the need for robots to possess enough cognition. The action recognition algorithms require massive training datasets, structural information of objects in the environment, and less expensive models in terms of computational complexity. In addition, many such algorithms are trained on datasets derived from daily activities. The algorithms trained on non-industrial datasets may have an unfavorable impact on implementing models and validating actions in an industrial context. This study proposed a lightweight deep learning model for classifying low-level actions in an assembly setting. The model is based on optical flow feature elicitation and mobilenetV2-SSD action classification and is trained and assessed on an actual industrial activities’ dataset. The experimental outcomes show that the presented method is futuristic and does not require extensive preprocessing; therefore, it can be promising in terms of the feasibility of action recognition for mutual performance monitoring in real-world HRI applications. The test result shows 80% accuracy for low-level RGB action classes. The study’s primary objective is to generate experimental results that may be used as a reference for future HRI algorithms based on the InHard dataset
Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton based Action Recognition
Skeleton-based human action recognition is a longstanding challenge due to
its complex dynamics. Some fine-grain details of the dynamics play a vital role
in classification. The existing work largely focuses on designing incremental
neural networks with more complicated adjacent matrices to capture the details
of joints relationships. However, they still have difficulties distinguishing
actions that have broadly similar motion patterns but belong to different
categories. Interestingly, we found that the subtle differences in motion
patterns can be significantly amplified and become easy for audience to
distinct through specified view directions, where this property haven't been
fully explored before. Drastically different from previous work, we boost the
performance by proposing a conceptually simple yet effective Multi-view
strategy that recognizes actions from a collection of dynamic view features.
Specifically, we design a novel Skeleton-Anchor Proposal (SAP) module which
contains a Multi-head structure to learn a set of views. For feature learning
of different views, we introduce a novel Angle Representation to transform the
actions under different views and feed the transformations into the baseline
model. Our module can work seamlessly with the existing action classification
model. Incorporated with baseline models, our SAP module exhibits clear
performance gains on many challenging benchmarks. Moreover, comprehensive
experiments show that our model consistently beats down the state-of-the-art
and remains effective and robust especially when dealing with corrupted data.
Related code will be available on https://github.com/ideal-idea/SAP
Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization
Zero-shot skeleton-based action recognition aims to recognize actions of
unseen categories after training on data of seen categories. The key is to
build the connection between visual and semantic space from seen to unseen
classes. Previous studies have primarily focused on encoding sequences into a
singular feature vector, with subsequent mapping the features to an identical
anchor point within the embedded space. Their performance is hindered by 1) the
ignorance of the global visual/semantic distribution alignment, which results
in a limitation to capture the true interdependence between the two spaces. 2)
the negligence of temporal information since the frame-wise features with rich
action clues are directly pooled into a single feature vector. We propose a new
zero-shot skeleton-based action recognition method via mutual information (MI)
estimation and maximization. Specifically, 1) we maximize the MI between visual
and semantic space for distribution alignment; 2) we leverage the temporal
information for estimating the MI by encouraging MI to increase as more frames
are observed. Extensive experiments on three large-scale skeleton action
datasets confirm the effectiveness of our method. Code:
https://github.com/YujieOuO/SMIE.Comment: Accepted by ACM MM 202
Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition
This paper presents Ske2Grid, a new representation learning framework for
improved skeleton-based action recognition. In Ske2Grid, we define a regular
convolution operation upon a novel grid representation of human skeleton, which
is a compact image-like grid patch constructed and learned through three novel
designs. Specifically, we propose a graph-node index transform (GIT) to
construct a regular grid patch through assigning the nodes in the skeleton
graph one by one to the desired grid cells. To ensure that GIT is a bijection
and enrich the expressiveness of the grid representation, an up-sampling
transform (UPT) is learned to interpolate the skeleton graph nodes for filling
the grid patch to the full. To resolve the problem when the one-step UPT is
aggressive and further exploit the representation capability of the grid patch
with increasing spatial size, a progressive learning strategy (PLS) is proposed
which decouples the UPT into multiple steps and aligns them to multiple paired
GITs through a compact cascaded design learned progressively. We construct
networks upon prevailing graph convolution networks and conduct experiments on
six mainstream skeleton-based action recognition datasets. Experiments show
that our Ske2Grid significantly outperforms existing GCN-based solutions under
different benchmark settings, without bells and whistles. Code and models are
available at https://github.com/OSVAI/Ske2GridComment: The paper of Ske2Grid is published at ICML 2023. Code and models are
available at https://github.com/OSVAI/Ske2Gri
AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation
We present AssemblyHands, a large-scale benchmark dataset with accurate 3D
hand pose annotations, to facilitate the study of egocentric activities with
challenging hand-object interactions. The dataset includes synchronized
egocentric and exocentric images sampled from the recent Assembly101 dataset,
in which participants assemble and disassemble take-apart toys. To obtain
high-quality 3D hand pose annotations for the egocentric images, we develop an
efficient pipeline, where we use an initial set of manual annotations to train
a model to automatically annotate a much larger dataset. Our annotation model
uses multi-view feature fusion and an iterative refinement scheme, and achieves
an average keypoint error of 4.20 mm, which is 85% lower than the error of the
original annotations in Assembly101. AssemblyHands provides 3.0M annotated
images, including 490K egocentric images, making it the largest existing
benchmark dataset for egocentric 3D hand pose estimation. Using this data, we
develop a strong single-view baseline of 3D hand pose estimation from
egocentric images. Furthermore, we design a novel action classification task to
evaluate predicted 3D hand poses. Our study shows that having higher-quality
hand poses directly improves the ability to recognize actions.Comment: CVPR 2023. Project page: https://assemblyhands.github.io
Learning action recognition model from depth and skeleton videos
Depth sensors open up possibilities of dealing with the human action recognition problem by providing 3D human skeleton data and depth images of the scene. Analysis of hu- man actions based on 3D skeleton data has become popular recently, due to its robustness and view-invariant represen- tation. However, the skeleton alone is insufficient to distin- guish actions which involve human-object interactions. In this paper, we propose a deep model which efficiently mod- els human-object interactions and intra-class variations un- der viewpoint changes. First, a human body-part model is introduced to transfer the depth appearances of body-parts to a shared view-invariant space. Second, an end-to-end learning framework is proposed which is able to effectively combine the view-invariant body-part representation from skeletal and depth images, and learn the relations between the human body-parts and the environmental objects, the interactions between different human body-parts, and the temporal structure of human actions. We have evaluated the performance of our proposed model against 15 existing techniques on two large benchmark human action recogni- tion datasets including NTU RGB+D and UWA3DII. The Experimental results show that our technique provides a significant improvement over state-of-the-art methods. 1
Cross View Action Recognition
openCross View Action Recognition (CVAR) appraises a system's ability to recognise actions from viewpoints that are unfamiliar to the system. The state of the art methods that train on large amounts of training data rely on variation in the training data itself to increase their ability to tackle viewpoints changes. Therefore, these methods not only require a large scale dataset of appropriate classes for the application every time they train, but also correspondingly large amount of computation power for the training process leading to high costs, in terms of time, effort, funds and electrical energy. In this thesis, we propose a methodological pipeline that tackles change in viewpoint, training on small datasets and employing sustainable amounts of resources. Our method uses the optical flow input with a stream of a pre-trained model as-is to obtain a feature. Thereafter, this feature is used to train a custom designed classifier that promotes view-invariant properties. Our method only uses video information as input, in contrast to another set of methods that approach CVAR by using depth or pose input at the expense of increased sensor costs. We present a number of comparative analysis that aided the design of the pipelines, farther assessing the power of each component in the pipeline. The technique can also be adopted to existing, trained classifiers, with minimal fine-tuning, as this work demonstrates by comparing classifiers including shallow classifiers, deep pre-trained classifiers and our proposed classifier trained from scratch. Additionally, we present a set of qualitative results that promote our understanding of the relationship between viewpoints in the feature-space.openXXXII CICLO - INFORMATICA E INGEGNERIA DEI SISTEMI/ COMPUTER SCIENCE AND SYSTEMS ENGINEERING - InformaticaGoyal, Gaurv
Human Action Recognition and Monitoring in Ambient Assisted Living Environments
Population ageing is set to become one of the most significant challenges of the 21st century, with implications for almost all sectors of society. Especially in developed countries, governments should immediately implement policies and solutions to facilitate the needs of an increasingly older population. Ambient Intelligence (AmI) and in particular the area of Ambient Assisted Living (AAL) offer a feasible response, allowing the creation of human-centric smart environments that are sensitive and responsive to the needs and behaviours of the user.
In such a scenario, understand what a human being is doing, if and how he/she is interacting with specific objects, or whether abnormal situations are occurring is critical.
This thesis is focused on two related research areas of AAL: the development of innovative vision-based techniques for human action recognition and the remote monitoring of users behaviour in smart environments.
The former topic is addressed through different approaches based on data extracted from RGB-D sensors.
A first algorithm exploiting skeleton joints orientations is proposed. This approach is extended through a multi-modal strategy that includes the RGB channel to define a number of temporal images, capable of describing the time evolution of actions.
Finally, the concept of template co-updating concerning action recognition is introduced. Indeed, exploiting different data categories (e.g., skeleton and RGB information) improve the effectiveness of template updating through co-updating techniques.
The action recognition algorithms have been evaluated on CAD-60 and CAD-120, achieving results comparable with the state-of-the-art. Moreover, due to the lack of datasets including skeleton joints orientations, a new benchmark named Office Activity Dataset has been internally acquired and released.
Regarding the second topic addressed, the goal is to provide a detailed implementation strategy concerning a generic Internet of Things monitoring platform that could be used for checking users' behaviour in AmI/AAL contexts