481,248 research outputs found
Human Action Recognition Using Multi-Stream Fusion and Hybrid Deep Neural Networks
Action Recognition in videos is a topic of interest in the area of computer vision, due to potential applications such as multimedia indexing and surveillance in public areas. In this research, we first propose spatial and temporal Convolutional Neural Network (CNNs), based on transfer learning using ResNet101, GoogleNet and VGG16, for undertaking human action recognition. Besides that, hybrid networks such as CNNRecurrent Neural Network (RNN) models are also exploited as encoder-decoder architectures for video action classification. In particular, different types of RNNs such as Long Short-Term Memory (LSTM), Bidirectional-LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Bidirectional-GRU (BiGRU), are exploited as the decoders for action recognition. To further enhance performance, diverse aggregation networks of CNN and CNN-RNN models are implemented. Specifically, an Average Fusion method is used to integrate spatial and temporal CNNs trained on images, as well as CNN-RNN trained on videos, where the final classification is formed by combining Softmax scores of these models via a late fusion. A total of 22 models (1 motion CNN, 3 spatial CNNs, 12 CNN-RNNs and 6 fusion networks) are implemented which are evaluated using UCF11, UCF50, and UCF101 datasets for performance comparison. The empirical results indicate the significant efficiency of Average Fusion of multiple Spatial-CNNs with one Motion-CNN, and ResNet101-BiGRU, among all the networks for undertaking realistic video action recognition
Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition
Recently, skeleton-based human action has become a hot research topic because
the compact representation of human skeletons brings new blood to this research
domain. As a result, researchers began to notice the importance of using RGB or
other sensors to analyze human action by extracting skeleton information.
Leveraging the rapid development of deep learning (DL), a significant number of
skeleton-based human action approaches have been presented with fine-designed
DL structures recently. However, a well-trained DL model always demands
high-quality and sufficient data, which is hard to obtain without costing high
expenses and human labor. In this paper, we introduce a novel data augmentation
method for skeleton-based action recognition tasks, which can effectively
generate high-quality and diverse sequential actions. In order to obtain
natural and realistic action sequences, we propose denoising diffusion
probabilistic models (DDPMs) that can generate a series of synthetic action
sequences, and their generation process is precisely guided by a
spatial-temporal transformer (ST-Trans). Experimental results show that our
method outperforms the state-of-the-art (SOTA) motion generation approaches on
different naturality and diversity metrics. It proves that its high-quality
synthetic data can also be effectively deployed to existing action recognition
models with significant performance improvement
Multimodal fusion for audio-image and video action recognition
Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at https://tinyurl.com/4ps2ux6n
Latent Semantic Learning with Structured Sparse Representation for Human Action Recognition
This paper proposes a novel latent semantic learning method for extracting
high-level features (i.e. latent semantics) from a large vocabulary of abundant
mid-level features (i.e. visual keywords) with structured sparse
representation, which can help to bridge the semantic gap in the challenging
task of human action recognition. To discover the manifold structure of
midlevel features, we develop a spectral embedding approach to latent semantic
learning based on L1-graph, without the need to tune any parameter for graph
construction as a key step of manifold learning. More importantly, we construct
the L1-graph with structured sparse representation, which can be obtained by
structured sparse coding with its structured sparsity ensured by novel L1-norm
hypergraph regularization over mid-level features. In the new embedding space,
we learn latent semantics automatically from abundant mid-level features
through spectral clustering. The learnt latent semantics can be readily used
for human action recognition with SVM by defining a histogram intersection
kernel. Different from the traditional latent semantic analysis based on topic
models, our latent semantic learning method can explore the manifold structure
of mid-level features in both L1-graph construction and spectral embedding,
which results in compact but discriminative high-level features. The
experimental results on the commonly used KTH action dataset and unconstrained
YouTube action dataset show the superior performance of our method.Comment: The short version of this paper appears in ICCV 201
Multimodal Hierarchical Dirichlet Process-based Active Perception
In this paper, we propose an active perception method for recognizing object
categories based on the multimodal hierarchical Dirichlet process (MHDP). The
MHDP enables a robot to form object categories using multimodal information,
e.g., visual, auditory, and haptic information, which can be observed by
performing actions on an object. However, performing many actions on a target
object requires a long time. In a real-time scenario, i.e., when the time is
limited, the robot has to determine the set of actions that is most effective
for recognizing a target object. We propose an MHDP-based active perception
method that uses the information gain (IG) maximization criterion and lazy
greedy algorithm. We show that the IG maximization criterion is optimal in the
sense that the criterion is equivalent to a minimization of the expected
Kullback--Leibler divergence between a final recognition state and the
recognition state after the next set of actions. However, a straightforward
calculation of IG is practically impossible. Therefore, we derive an efficient
Monte Carlo approximation method for IG by making use of a property of the
MHDP. We also show that the IG has submodular and non-decreasing properties as
a set function because of the structure of the graphical model of the MHDP.
Therefore, the IG maximization problem is reduced to a submodular maximization
problem. This means that greedy and lazy greedy algorithms are effective and
have a theoretical justification for their performance. We conducted an
experiment using an upper-torso humanoid robot and a second one using synthetic
data. The experimental results show that the method enables the robot to select
a set of actions that allow it to recognize target objects quickly and
accurately. The results support our theoretical outcomes.Comment: submitte
- …