138 research outputs found
Attention in Convolutional LSTM for Gesture Recognition
Convolutional long short-term memory (LSTM) networks have been widely used for action/gesture recognition, and different attention mechanisms have also been embedded into the LSTM or the convolutional LSTM (ConvLSTM) networks. Based on the previous gesture recognition architectures which combine the threedimensional convolution neural network (3DCNN) and ConvLSTM, this paper explores the effects of attention mechanism in ConvLSTM. Several variants of ConvLSTM are evaluated: (a) Removing the convolutional structures of the three gates in ConvLSTM, (b) Applying the attention mechanism on the input of ConvLSTM, (c) Reconstructing the input and (d) output gates respectively with the modified channel-wise attention mechanism. The evaluation results demonstrate that the spatial convolutions in the three gates scarcely contribute to the spatiotemporal feature fusion, and the attention mechanisms embedded into the input and output gates cannot improve the feature fusion. In other words, ConvLSTM mainly contributes to the temporal fusion along with the recurrent steps to learn the long-term spatiotemporal features, when taking as input the spatial or spatiotemporal features. On this basis, a new variant of LSTM is derived, in which the convolutional structures are only embedded into the input-to-state transition of LSTM. The code of the LSTM variants is publicly available2
Regional Attention with Architecture-Rebuilt 3D Network for RGB-D Gesture Recognition
Human gesture recognition has drawn much attention in the area of computer
vision. However, the performance of gesture recognition is always influenced by
some gesture-irrelevant factors like the background and the clothes of
performers. Therefore, focusing on the regions of hand/arm is important to the
gesture recognition. Meanwhile, a more adaptive architecture-searched network
structure can also perform better than the block-fixed ones like Resnet since
it increases the diversity of features in different stages of the network
better. In this paper, we propose a regional attention with
architecture-rebuilt 3D network (RAAR3DNet) for gesture recognition. We replace
the fixed Inception modules with the automatically rebuilt structure through
the network via Neural Architecture Search (NAS), owing to the different shape
and representation ability of features in the early, middle, and late stage of
the network. It enables the network to capture different levels of feature
representations at different layers more adaptively. Meanwhile, we also design
a stackable regional attention module called dynamic-static Attention (DSA),
which derives a Gaussian guidance heatmap and dynamic motion map to highlight
the hand/arm regions and the motion information in the spatial and temporal
domains, respectively. Extensive experiments on two recent large-scale RGB-D
gesture datasets validate the effectiveness of the proposed method and show it
outperforms state-of-the-art methods. The codes of our method are available at:
https://github.com/zhoubenjia/RAAR3DNet.Comment: Accepted by AAAI 202
Deep Multi-Model Fusion for Human Activity Recognition Using Evolutionary Algorithms
Machine recognition of the human activities is an active research area in computer vision. In previous study, either one or two types of modalities have been used to handle this task. However, the grouping of maximum information improves the recognition accuracy of human activities. Therefore, this paper proposes an automatic human activity recognition system through deep fusion of multi-streams along with decision-level score optimization using evolutionary algorithms on RGB, depth maps and 3d skeleton joint information. Our proposed approach works in three phases, 1) space-time activity learning using two 3D Convolutional Neural Network (3DCNN) and a Long Sort Term Memory (LSTM) network from RGB, Depth and skeleton joint positions 2) Training of SVM using the activities learned from previous phase for each model and score generation using trained SVM 3) Score fusion and optimization using two Evolutionary algorithm such as Genetic algorithm (GA) and Particle Swarm Optimization (PSO) algorithm. The proposed approach is validated on two 3D challenging datasets, MSRDailyActivity3D and UTKinectAction3D. Experiments on these two datasets achieved 85.94% and 96.5% accuracies, respectively. The experimental results show the usefulness of the proposed representation. Furthermore, the fusion of different modalities improves recognition accuracies rather than using one or two types of information and obtains the state-of-art results
Error Action Recognition on Playing The Erhu Musical Instrument Using Hybrid Classification Method with 3D-CNN and LSTM
Erhu is a stringed instrument originating from China. In playing this instrument, there are rules on how to position the player's body and hold the instrument correctly. Therefore, a system is needed that can detect every movement of the Erhu player. This study will discuss action recognition on video using the 3DCNN and LSTM methods. The 3D Convolutional Neural Network method is a method that has a CNN base. To improve the ability to capture every information stored in every movement, combining an LSTM layer in the 3D-CNN model is necessary. LSTM is capable of handling the vanishing gradient problem faced by RNN. This research uses RGB video as a dataset, and there are three main parts in preprocessing and feature extraction. The three main parts are the body, erhu pole, and bow. To perform preprocessing and feature extraction, this study uses a body landmark to perform preprocessing and feature extraction on the body segment. In contrast, the erhu and bow segments use the Hough Lines algorithm. Furthermore, for the classification process, we propose two algorithms, namely, traditional algorithm and deep learning algorithm. These two-classification algorithms will produce an error message output from every movement of the erhu player
End-to-End Multiview Gesture Recognition for Autonomous Car Parking System
The use of hand gestures can be the most intuitive human-machine interaction medium.
The early approaches for hand gesture recognition used device-based methods. These
methods use mechanical or optical sensors attached to a glove or markers, which hinders
the natural human-machine communication. On the other hand, vision-based methods are
not restrictive and allow for a more spontaneous communication without the need of an
intermediary between human and machine. Therefore, vision gesture recognition has been
a popular area of research for the past thirty years.
Hand gesture recognition finds its application in many areas, particularly the automotive
industry where advanced automotive human-machine interface (HMI) designers are
using gesture recognition to improve driver and vehicle safety. However, technology advances
go beyond active/passive safety and into convenience and comfort. In this context,
one of America’s big three automakers has partnered with the Centre of Pattern Analysis
and Machine Intelligence (CPAMI) at the University of Waterloo to investigate expanding
their product segment through machine learning to provide an increased driver convenience
and comfort with the particular application of hand gesture recognition for autonomous
car parking.
In this thesis, we leverage the state-of-the-art deep learning and optimization techniques
to develop a vision-based multiview dynamic hand gesture recognizer for self-parking system.
We propose a 3DCNN gesture model architecture that we train on a publicly available
hand gesture database. We apply transfer learning methods to fine-tune the pre-trained
gesture model on a custom-made data, which significantly improved the proposed system
performance in real world environment. We adapt the architecture of the end-to-end solution
to expand the state of the art video classifier from a single image as input (fed by
monocular camera) to a multiview 360 feed, offered by a six cameras module. Finally, we
optimize the proposed solution to work on a limited resources embedded platform (Nvidia
Jetson TX2) that is used by automakers for vehicle-based features, without sacrificing the
accuracy robustness and real time functionality of the system
- …