175,940 research outputs found
Describing Videos by Exploiting Temporal Structure
Recent progress in using recurrent neural networks (RNNs) for image
description has motivated the exploration of their application for video
description. However, while images are static, working with videos requires
modeling their dynamic temporal structure and then properly integrating that
information into a natural language description. In this context, we propose an
approach that successfully takes into account both the local and global
temporal structure of videos to produce descriptions. First, our approach
incorporates a spatial temporal 3-D convolutional neural network (3-D CNN)
representation of the short temporal dynamics. The 3-D CNN representation is
trained on video action recognition tasks, so as to produce a representation
that is tuned to human motion and behavior. Second we propose a temporal
attention mechanism that allows to go beyond local temporal modeling and learns
to automatically select the most relevant temporal segments given the
text-generating RNN. Our approach exceeds the current state-of-art for both
BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on
a new, larger and more challenging dataset of paired video and natural language
descriptions.Comment: Accepted to ICCV15. This version comes with code release and
supplementary materia
Representation and recognition of human actions in video
PhDAutomated human action recognition plays a critical role in the development of human-machine
communication, by aiming for a more natural interaction between artificial intelligence and the
human society. Recent developments in technology have permitted a shift from a traditional
human action recognition performed in a well-constrained laboratory environment to realistic
unconstrained scenarios. This advancement has given rise to new problems and challenges still
not addressed by the available methods. Thus, the aim of this thesis is to study innovative approaches
that address the challenging problems of human action recognition from video captured
in unconstrained scenarios. To this end, novel action representations, feature selection methods,
fusion strategies and classification approaches are formulated.
More specifically, a novel interest points based action representation is firstly introduced, this
representation seeks to describe actions as clouds of interest points accumulated at different temporal
scales. The idea behind this method consists of extracting holistic features from the point
clouds and explicitly and globally describing the spatial and temporal action dynamic. Since
the proposed clouds of points representation exploits alternative and complementary information
compared to the conventional interest points-based methods, a more solid representation is then
obtained by fusing the two representations, adopting a Multiple Kernel Learning strategy. The
validity of the proposed approach in recognising action from a well-known benchmark dataset is
demonstrated as well as the superior performance achieved by fusing representations.
Since the proposed method appears limited by the presence of a dynamic background and fast
camera movements, a novel trajectory-based representation is formulated. Different from interest
points, trajectories can simultaneously retain motion and appearance information even in noisy
and crowded scenarios. Additionally, they can handle drastic camera movements and a robust
region of interest estimation. An equally important contribution is the proposed collaborative
feature selection performed to remove redundant and noisy components. In particular, a novel
feature selection method based on Multi-Class Delta Latent Dirichlet Allocation (MC-DLDA)
is introduced. Crucial, to enrich the final action representation, the trajectory representation is
adaptively fused with a conventional interest point representation. The proposed approach is
extensively validated on different datasets, and the reported performances are comparable with
the best state-of-the-art. The obtained results also confirm the fundamental contribution of both
collaborative feature selection and adaptive fusion.
Finally, the problem of realistic human action classification in very ambiguous scenarios is
taken into account. In these circumstances, standard feature selection methods and multi-class
classifiers appear inadequate due to: sparse training set, high intra-class variation and inter-class
similarity. Thus, both the feature selection and classification problems need to be redesigned.
The proposed idea is to iteratively decompose the classification task in subtasks and select the
optimal feature set and classifier in accordance with the subtask context. To this end, a cascaded
feature selection and action classification approach is introduced. The proposed cascade aims to
classify actions by exploiting as much information as possible, and at the same time trying to
simplify the multi-class classification in a cascade of binary separations. Specifically, instead of
separating multiple action classes simultaneously, the overall task is automatically divided into
easier binary sub-tasks. Experiments have been carried out using challenging public datasets;
the obtained results demonstrate that with identical action representation, the cascaded classifier
significantly outperforms standard multi-class classifiers
Spatio-temporal human action detection and instance segmentation in videos
With an exponential growth in the number of video capturing devices and digital video content, automatic video understanding is now at the forefront of computer vision research. This thesis presents a series of models for automatic human action detection in videos and also addresses the space-time action instance segmentation problem. Both action detection and instance segmentation play vital roles in video understanding.
Firstly, we propose a novel human action detection approach based on a frame-level deep feature representation combined with a two-pass dynamic programming approach. The method obtains a frame-level action representation by leveraging recent advances in deep learning based action recognition and object detection methods. To combine the the complementary appearance and motion cues, we introduce a new fusion technique which signicantly improves the detection performance. Further, we cast the temporal action detection as two energy optimisation problems which are solved using Viterbi algorithm.
Exploiting a video-level representation further allows the network to learn the inter-frame temporal correspondence between action regions and it is bound to be a more optimal solution to the action detection problem than a frame-level representation. Secondly, we propose a novel deep network architecture which learns a video-level action representation by classifying and regressing 3D region proposals spanning two successive video frames. The proposed model is end-to-end trainable and can be jointly optimised for both proposal generation and action detection objectives in a single training step. We name our new network as \AMTnet" (Action Micro-Tube regression Network). We further extend the AMTnet model by incorporating optical ow features to encode motion patterns of actions.
Finally, we address the problem of action instance segmentation in which multiple concurrent actions of the same class may be segmented out of an image sequence. By taking advantage of recent work on action foreground-background segmentation, we are able to associate each action tube with class-specic segmentations.
We demonstrate the performance of our proposed models on challenging action detection benchmarks achieving new state-of-the-art results across the board and signicantly increasing detection speed at test time
Histogram of oriented rectangles: A new pose descriptor for human action recognition
Cataloged from PDF version of article.Most of the approaches to human action recognition tend to form complex models which require lots of parameter estimation and computation time. In this study, we show that, human actions can be simply represented by pose without dealing with the complex representation of dynamics. Based on this idea, we propose a novel pose descriptor which we name as Histogram-of-Oriented-Rectangles (HOR) for representing and recognizing human actions in videos. We represent each human pose in an action sequence by oriented rectangular patches extracted over the human silhouette. We then form spatial oriented histograms to represent the distribution of these rectangular patches. We make use of several matching strategies to carry the information from the spatial domain described by the HOR descriptor to temporal domain. These are (i) nearest neighbor classification, which recognizes the actions by matching the descriptors of each frame, (ii) global histogramming, which extends the idea of Motion Energy Image proposed by Bobick and Davis to rectangular patches, (iii) a classifier-based approach using Support Vector Machines, and (iv) adaptation of Dynamic Time Warping on the temporal representation of the HOR descriptor. For the cases when pose descriptor is not sufficiently strong alone, such as to differentiate actions "jogging" and "running", we also incorporate a simple velocity descriptor as a prior to the pose based classification step. We test our system with different configurations and experiment on two commonly used action datasets: the Weizmann dataset and the KTH dataset. Results show that our method is superior to other methods on Weizmann dataset with a perfect accuracy rate of 100%, and is comparable to the other methods on KTH dataset with a very high success rate close to 90%. These results prove that with a simple and compact representation, we can achieve robust recognition of human actions, compared to complex representations. (C) 2009 Elsevier B.V. All rights reserved
Recommended from our members
Human Motion Anticipation and Recognition from RGB-D
Predicting and understanding the dynamic of human motion has many applications such as motion synthesis, augmented reality, security, education, reinforcement learning, autonomous vehicles, and many others. In this thesis, we create a novel end-to-end pipeline that can predict multiple future poses from the same input, and, in addition, can classify the entire sequence. Our focus is on the following two aspects of human motion understanding:
Probabilistic human action prediction: Given a sequence of human poses as input, we sample multiple possible future poses from the same input sequence using a new GAN-based network.
Human motion understanding: Given a sequence of human poses as input, we classify the actual action performed in the sequence and improve the classification performance using the presentation learned from the prediction network.
We also demonstrate how to improve model training from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. We shared the enhanced FER+ data set with multiple labels for each face image with the research community (https://github.com/Microsoft/FERPlus).
For predicting and understanding of human motion, we propose a novel sequence-to-sequence model trained with an improved version of generative adversarial networks (GAN). Our model, which we call HP-GAN2, learns a probability density function of future human poses conditioned on previous poses. It predicts multiple sequences of possible future human poses, each from the same input sequence but seeded with a different vector z drawn from a random distribution. Moreover, to quantify the quality of the non-deterministic predictions, we simultaneously train a motion-quality-assessment model that learns the probability that a given skeleton pose sequence is a real or fake human motion.
In order to classify the action performed in a video clip, we took two approaches. In the first approach, we train on a sequence of skeleton poses from scratch using random parameters initialization with the same network architecture used in the discriminator of the HP-GAN2 model. For the second approach, we use the discriminator of the HP-GAN2 network, extend it with an action classification branch, and fine tune the end-to-end model on the classification tasks, since the discriminator in HP-GAN2 learned to differentiate between fake and real human motion. So, our hypothesis is that if the discriminator network can differentiate between synthetic and real skeleton poses, then it also has learned some of the dynamics of a real human motion, and that those dynamics are useful in classification as well. We will show through multiple experiments that that is indeed the case.
Therefore, our model learns to predict multiple future sequences of human poses from the same input sequence. We also show that the discriminator learns a general representation of human motion by using the learned features in an action recognition task. And we train a motion-quality-assessment network that measure the probability of a given sequence of poses are valid human poses or not.
We test our model on two of the largest human pose datasets: NTURGB-D, and Human3.6M. We train on both single and multiple action types. The predictive power of our model for motion estimation is demonstrated by generating multiple plausible futures from the same input and showing the effect of each of the several loss functions in the ablation study. We also show the advantage of switching to GAN from WGAN-GP, which we used in our previous work. Furthermore, we show that it takes less than half the number of epochs to train an activity recognition network by using the features learned from the discriminator
Human action recognition based on key postures
University of Technology, Sydney. Faculty of Engineering and Information Technology.Human motion analysis has gained considerable interests in the computer vision area
due to the large number of potential applications and its inherent complexity. Currently,
human motion analysis is at an early stage. Its final aim is to generate an
easy understanding, high level semantic description in a given scene. Human action
recognition is an important step to the final aim of human motion analysis.
Human Detection
Human detection is part of the field of human motion analysis. The thesis looks
at human detection. The thesis proposes a method using histogram of angles to
discriminate pedestrians from vehicles. This proposed method is encouraged by the
reality that humans are non-rigid objects, An angle formed by the centroid point and
two bottom points for a human changes periodically while the angle for the vehicle is
relatively static. In this part, this thesis also presents an approach to detect humans in
static images. The thesis proposes an approach which uses human geometric features
to fulfill the task.
Human Action Recognition
The thesis focuses on human action recognition. The thesis proposes what will be
called a key postures based human action recognition approach. As we have known,
human actions can be well described by a few important postures (called key postures)
which are significantly different from each other and all other postures can be
clustered to these key postures. Therefore, these key postures can be used to represent
and to infer the corresponding human action. The benefit of using key postures to
represent human action is to reduce computational complexity. The thesis proposes
two methods for human action recognition based on key postures. One is a human action
recognition based on shape features and the other one is action recognition based
on Radon transforms. Both methods follow three steps to achieve action recognition.
These steps are video processing, key posture extraction and action recognition.
A two-step approach is proposed to extract key postures from preprocessed action
video. These two steps are coarse selection and fine selection. Feature extraction and
representation are discussed in both steps. After key postures are extracted from a
video, key posture sequences are used to represent human actions. Each key posture
sequence is regarded as an action template. In order to compare two action sequences,
Dynamic Time Warping (DTW) is applied to determine the distance between the two
action sequences.
In the second method, in order to obtain key postures, the action sequences are
extracted from the preprocessed silhouettes using Radon transforms. Then, an unsupervised
cluster analysis is applied to Radon transforms to identify the key postures
for each sequence. Such key postures are used in the subsequent training and testing
procedure. Several benchmark classifiers are used in this work for action learning and
classification.
Author's Publications
This thesis covers the research results conducted by the author while undertaking for
the degree. Most of the results have been published in research papers in refereed
publications which are listed in Author's Publication for Doctor of Philosophy (PhD)
Human Action Recognition Using Deep Probabilistic Graphical Models
Building intelligent systems that are capable of representing or extracting high-level representations from high-dimensional sensory data lies at the core of solving many A.I. related tasks. Human action recognition is an important topic in computer vision that lies in high-dimensional space. Its applications include robotics, video surveillance, human-computer interaction, user interface design, and multi-media video retrieval amongst others.
A number of approaches have been proposed to extract representative features from high-dimensional temporal data, most commonly hard wired geometric or bio-inspired shape context features.
This thesis first demonstrates some \emph{ad-hoc} hand-crafted rules for effectively encoding motion features, and later elicits a more generic approach for incorporating structured feature learning and reasoning, \ie deep probabilistic graphical models.
The hierarchial dynamic framework first extracts high level features and then uses the learned representation for estimating emission probability to infer action sequences.
We show that better action recognition can be achieved by replacing gaussian mixture models by Deep Neural Networks that contain many layers of features to predict probability distributions over states of Markov Models. The framework can be easily extended to include an ergodic state to segment and recognise actions simultaneously.
The first part of the thesis focuses on analysis and applications of hand-crafted features for human action representation and classification. We show that the ``hard coded" concept of correlogram can incorporate correlations between time domain sequences and we further investigate multi-modal inputs, \eg depth sensor input and its unique traits for action recognition.
The second part of this thesis focuses on marrying probabilistic graphical models with Deep Neural Networks (both Deep Belief Networks and Deep 3D Convolutional Neural Networks) for structured sequence prediction. The proposed Deep Dynamic Neural Network exhibits its general framework for structured 2D data representation and classification. This inspires us to further investigate for applying various graphical models for time-variant video sequences
Human action recognition using distribution of oriented rectangular patches
We describe a "bag-of-rectangles" method for representing and recognizing human actions in videos. In this method, each human pose in an action sequence is represented by oriented rectangular patches extracted over the whole body. Then, spatial oriented histograms are formed to represent the distribution of these rectangular patches. In order to carry the information from the spatial domain described by the bag-of-rectangles descriptor to temporal domain for recognition of the actions, four different methods are proposed. These are namely, (i) frame by frame voting, which recognizes the actions by matching the descriptors of each frame, (ii) global histogramming, which extends the idea of Motion Energy Image proposed by Bobick and Davis by rectangular patches, (iii) a classifier based approach using SVMs, and (iv) adaptation of Dynamic Time Warping on the temporal representation of the descriptor. The detailed experiments are carried out on the action dataset of Blank et. al. High success rates (100%) prove that with a very simple and compact representation, we can achieve robust recognition of human actions, compared to complex representations. © Springer-Verlag Berlin Heidelberg 2007
- …