414,155 research outputs found
Multi-Label Zero-Shot Human Action Recognition via Joint Latent Ranking Embedding
Human action recognition refers to automatic recognizing human actions from a
video clip. In reality, there often exist multiple human actions in a video
stream. Such a video stream is often weakly-annotated with a set of relevant
human action labels at a global level rather than assigning each label to a
specific video episode corresponding to a single action, which leads to a
multi-label learning problem. Furthermore, there are many meaningful human
actions in reality but it would be extremely difficult to collect/annotate
video clips regarding all of various human actions, which leads to a zero-shot
learning scenario. To the best of our knowledge, there is no work that has
addressed all the above issues together in human action recognition. In this
paper, we formulate a real-world human action recognition task as a multi-label
zero-shot learning problem and propose a framework to tackle this problem in a
holistic way. Our framework holistically tackles the issue of unknown temporal
boundaries between different actions for multi-label learning and exploits the
side information regarding the semantic relationship between different human
actions for knowledge transfer. Consequently, our framework leads to a joint
latent ranking embedding for multi-label zero-shot human action recognition. A
novel neural architecture of two component models and an alternate learning
algorithm are proposed to carry out the joint latent ranking embedding
learning. Thus, multi-label zero-shot recognition is done by measuring
relatedness scores of action labels to a test video clip in the joint latent
visual and semantic embedding spaces. We evaluate our framework with different
settings, including a novel data split scheme designed especially for
evaluating multi-label zero-shot learning, on two datasets: Breakfast and
Charades. The experimental results demonstrate the effectiveness of our
framework.Comment: 27 pages, 10 figures and 7 tables. Technical report submitted to a
journal. More experimental results/references were added and typos were
correcte
Human Action Recognition in Videos Using Transfer Learning
A variety of systems focus on detecting the actions and activities performed by humans, such as video surveillance and health monitoring systems. However, published labelled human action datasets for training supervised machine learning models are limited in number and expensive to produce. The use of transfer learning for the task of action recognition can help to address this issue by transferring or re-using the knowledge of existing trained models, in combination with minimal training data from the new target domain. Our focus in this paper is an investigation of video feature representations and machine learning algorithms for transfer learning for the task of action recognition in videos in a multi-class environment. Using four labelled datasets from the human action domain, we apply two SVM-based transfer-learning algorithms: adaptive support vector machine (A-SVM) and projective model transfer SVM (PMT-SVM). For feature representations, we compare the performance of two widely used video feature representations: space-time interest points (STIP) with Histograms of Oriented Gradients (HOG) and Histograms of Optical Flow (HOF), and improved dense trajectory (iDT) to explore which feature is more suitable for action recognition from videos using transfer learning. Our results show that A-SVM and PMT-SVM can help transfer action knowledge across multiple datasets with limited labelled training data; A-SVM outperforms PMT-SVM when the target dataset is derived from realistic non-lab environments; iDT has a greater ability to perform transfer learning in action recognition
TransNet: A Transfer Learning-Based Network for Human Action Recognition
Human action recognition (HAR) is a high-level and significant research area
in computer vision due to its ubiquitous applications. The main limitations of
the current HAR models are their complex structures and lengthy training time.
In this paper, we propose a simple yet versatile and effective end-to-end deep
learning architecture, coined as TransNet, for HAR. TransNet decomposes the
complex 3D-CNNs into 2D- and 1D-CNNs, where the 2D- and 1D-CNN components
extract spatial features and temporal patterns in videos, respectively.
Benefiting from its concise architecture, TransNet is ideally compatible with
any pretrained state-of-the-art 2D-CNN models in other fields, being
transferred to serve the HAR task. In other words, it naturally leverages the
power and success of transfer learning for HAR, bringing huge advantages in
terms of efficiency and effectiveness. Extensive experimental results and the
comparison with the state-of-the-art models demonstrate the superior
performance of the proposed TransNet in HAR in terms of flexibility, model
complexity, training speed and classification accuracy
Hierarchical transfer learning for online recognition of compound actions
Recognising human actions in real-time can provide users with a natural user interface (NUI) enabling a range of innovative and immersive applications. A NUI application should not restrict users’ movements; it should allow users to transition between actions in quick succession, which we term as compound actions. However, the majority of action recognition researchers have focused on individual actions, so their approaches are limited to recognising single actions or multiple actions that are temporally separated.
This paper proposes a novel online action recognition method for fast detection of compound actions. A key contribution is our hierarchical body model that can be automatically configured to detect actions based on the low level body parts that are the most discriminative for a particular action. Another key contribution is a transfer learning strategy to allow the tasks of action segmentation and whole body modelling to be performed on a related but simpler dataset, combined with automatic hierarchical body model adaption on a more complex target dataset.
Experimental results on a challenging and realistic dataset show an improvement in action recognition performance of 16% due to the introduction of our hierarchical transfer learning. The proposed algorithm is fast with an average latency of just 2 frames (66ms) and outperforms state of the art action recognition algorithms that are capable of fast online action recognition
SVMDnet: A Novel Framework for Elderly Activity Recognition based on Transfer Learning
Elderly Activity Recognition has become very crucial now-a-days because majority of elderly people are living alone and are vulnerable. Despite the fact that several researchers employ ML (machine learning) and DL (deep learning) techniques to recognize elderly actions, relatively lesser research specifically aimed on transfer learning based elderly activity recognition. Even transfer learning is not sufficient to handle the complexity levels in the HAR related problems because it is a more general approach. A novel transfer leaning based framework SVMDnet is proposed in which pre-trained deep neural network extracts essential action features and to classify actions, Support Vector Machine (SVM) is used as a classifier. The proposed model is evaluated on Stanford-40 Dataset and self-made dataset. The older volunteers over the age of 60 were recruited for the main dataset, which was compiled from their responses in a uniform environment with 10 kinds of activities. Results from SVMDnet on the two datasets shows that our model behaves well with human recognition and human-object interactions as well
Dual viewpoint passenger state classification using 3D CNNs
The rise of intelligent vehicle systems will lead to more human-machine interactions and so there is a need to create a bridge between the system and the actions and behaviours of the people inside the vehicle. In this paper, we propose a dual camera setup to monitor the actions and behaviour of vehicle passengers and a deep learning architecture which can utilise video data to classify a range of actions. The method incorporates two different views as input to a 3D convolutional network and uses transfer learning from other action recognition data. The performance of this method is evaluated using an in-vehicle dataset, which contains video recordings of people performing a range of common in-vehicle actions. We show that the combination of transfer learning and using dual viewpoints in a 3D action recognition network offers an increase in classification accuracy of action classes with distinct poses, e.g. mobile phone use and sleeping, whilst it does not apply as well for classifying those actions with small movements, such as talking and eating
Ensembles of Deep Neural Networks for Action Recognition in Still Images
Despite the fact that notable improvements have been made recently in the
field of feature extraction and classification, human action recognition is
still challenging, especially in images, in which, unlike videos, there is no
motion. Thus, the methods proposed for recognizing human actions in videos
cannot be applied to still images. A big challenge in action recognition in
still images is the lack of large enough datasets, which is problematic for
training deep Convolutional Neural Networks (CNNs) due to the overfitting
issue. In this paper, by taking advantage of pre-trained CNNs, we employ the
transfer learning technique to tackle the lack of massive labeled action
recognition datasets. Furthermore, since the last layer of the CNN has
class-specific information, we apply an attention mechanism on the output
feature maps of the CNN to extract more discriminative and powerful features
for classification of human actions. Moreover, we use eight different
pre-trained CNNs in our framework and investigate their performance on Stanford
40 dataset. Finally, we propose using the Ensemble Learning technique to
enhance the overall accuracy of action classification by combining the
predictions of multiple models. The best setting of our method is able to
achieve 93.17 accuracy on the Stanford 40 dataset.Comment: 5 pages, 2 figures, 3 tables, Accepted by ICCKE 201
When Kernel Methods meet Feature Learning: Log-Covariance Network for Action Recognition from Skeletal Data
Human action recognition from skeletal data is a hot research topic and
important in many open domain applications of computer vision, thanks to
recently introduced 3D sensors. In the literature, naive methods simply
transfer off-the-shelf techniques from video to the skeletal representation.
However, the current state-of-the-art is contended between to different
paradigms: kernel-based methods and feature learning with (recurrent) neural
networks. Both approaches show strong performances, yet they exhibit heavy, but
complementary, drawbacks. Motivated by this fact, our work aims at combining
together the best of the two paradigms, by proposing an approach where a
shallow network is fed with a covariance representation. Our intuition is that,
as long as the dynamics is effectively modeled, there is no need for the
classification network to be deep nor recurrent in order to score favorably. We
validate this hypothesis in a broad experimental analysis over 6 publicly
available datasets.Comment: 2017 IEEE Computer Vision and Pattern Recognition (CVPR) Workshop
- …