2,249 research outputs found
Recognising and localising human actions
Human action recognition in challenging video data is becoming an increasingly important research area. Given the growing number of cameras and robots pointing their lenses at humans, the need for automatic recognition of human actions arises, promising Google-style video search and automatic video summarisation/description. Furthermore, for any autonomous robotic system to interact with humans, it must rst be able to understand and quickly react to human actions.
Although the best action classication methods aggregate features from the entire video clip in which the action unfolds, this global representation may include irrelevant scene context and movements which are shared amongst multiple action classes. For example, a waving action may be performed whilst
walking, however if the walking movement appears in distinct action classes, then it should not be included in training a waving movement classier. For this reason, we propose an action classication framework in which more discriminative action subvolumes are learned in a weakly supervised setting, owing to the diculty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume. Each subvolume is cast as a bag-of-features (BoF) instance in a multiple-instance-learning framework, which
in turn is used to learn its class membership. We demonstrate quantitatively that even with single xed-sized subvolumes, the classication performance of our proposed algorithm is superior to our BoF baseline on the majority of performance measures, and shows promise for space-time action localisation on the most challenging video datasets.
Exploiting spatio-temporal structure in the video should also improve results, just as deformable part models have proven highly successful in object recognition. However, whereas objects have clear boundaries which means we can easily dene a ground truth for initialisation, 3D space-time actions are inherently ambiguous and expensive to annotate in large datasets. Thus, it is desirable to adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features (LDSBoF) in which local discriminative regions are split into axed grid of parts that are allowed to deform in both space and time at test-time. In our experimental evaluation we demonstrate that by using local, deformable space-time action parts, we are able to achieve very competitive classification performance, whilst being able to localise actions even in the most challenging video datasets.
A recent trend in action recognition is towards larger and more challenging datasets, an increasing number of action classes and larger visual vocabularies. For the global classication of human action video clips, the bag-of-visual-words pipeline is currently the best performing. However, the strategies chosen to sample features and construct a visual vocabulary are critical to performance, in fact often dominating performance. Thus, we provide a critical evaluation of various approaches to building a vocabulary and show that good practises do have a signicant impact. By subsampling and partitioning
features strategically, we are able to achieve state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies.
Another promising approach to recognise human actions first encodes the action sequence via a generative dynamical model. However, using classical distances for their classication does not necessarily deliver good results. Therefore we propose a general framework for learning distance functions between dynamical models, given a training set of labelled videos. The optimal distance function is selected among a family of `pullback' ones, induced by a parametrised mapping of the space of models. We focus here on hidden Markov models and their model space, and show how pullback distance learning greatly improves action recognition performances with respect to base distances.
Finally, the action classication systems that use a single global representation for each video clip are tailored for oine batch classication benchmarks. For human-robot interaction however, current systems fall short, either because they can only detect one human action per video frame, or because they assume the video is available ahead of time. In this work we propose an online human action detection system that can incrementally detect multiple concurrent space-time actions. In this way, it becomes possible to learn new action classes on-the-fly, allowing multiple people to actively teach and interact
with a robot
Recommended from our members
Semantic Concept Co-Occurrence Patterns for Image Annotation and Retrieval.
Describing visual image contents by semantic concepts is an effective and straightforward way to facilitate various high level applications. Inferring semantic concepts from low-level pictorial feature analysis is challenging due to the semantic gap problem, while manually labeling concepts is unwise because of a large number of images in both online and offline collections. In this paper, we present a novel approach to automatically generate intermediate image descriptors by exploiting concept co-occurrence patterns in the pre-labeled training set that renders it possible to depict complex scene images semantically. Our work is motivated by the fact that multiple concepts that frequently co-occur across images form patterns which could provide contextual cues for individual concept inference. We discover the co-occurrence patterns as hierarchical communities by graph modularity maximization in a network with nodes and edges representing concepts and co-occurrence relationships separately. A random walk process working on the inferred concept probabilities with the discovered co-occurrence patterns is applied to acquire the refined concept signature representation. Through experiments in automatic image annotation and semantic image retrieval on several challenging datasets, we demonstrate the effectiveness of the proposed concept co-occurrence patterns as well as the concept signature representation in comparison with state-of-the-art approaches
Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition
Most state-of-the-art action feature extractors involve differential
operators, which act as highpass filters and tend to attenuate low frequency
action information. This attenuation introduces bias to the resulting features
and generates ill-conditioned feature matrices. The Gaussian Pyramid has been
used as a feature enhancing technique that encodes scale-invariant
characteristics into the feature space in an attempt to deal with this
attenuation. However, at the core of the Gaussian Pyramid is a convolutional
smoothing operation, which makes it incapable of generating new features at
coarse scales. In order to address this problem, we propose a novel feature
enhancing technique called Multi-skIp Feature Stacking (MIFS), which stacks
features extracted using a family of differential filters parameterized with
multiple time skips and encodes shift-invariance into the frequency space. MIFS
compensates for information lost from using differential operators by
recapturing information at coarse scales. This recaptured information allows us
to match actions at different speeds and ranges of motion. We prove that MIFS
enhances the learnability of differential-based features exponentially. The
resulting feature matrices from MIFS have much smaller conditional numbers and
variances than those from conventional methods. Experimental results show
significantly improved performance on challenging action recognition and event
detection tasks. Specifically, our method exceeds the state-of-the-arts on
Hollywood2, UCF101 and UCF50 datasets and is comparable to state-of-the-arts on
HMDB51 and Olympics Sports datasets. MIFS can also be used as a speedup
strategy for feature extraction with minimal or no accuracy cost
Scale coding a bag of words for real-time video-based action recognition.
Masters Degree.University of KwaZulu- Natal, Durban.Abstract available in PDF
- …