3,040 research outputs found
Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey
Interest in automatic action and gesture recognition has grown considerably in the last few years. This is due in part to the large number of application domains for this type of technology. As in many other computer vision areas, deep learning based methods have quickly become a reference methodology for obtaining state-of-the-art performance in both tasks. This chapter is a survey of current deep learning based methodologies for action and gesture recognition in sequences of images. The survey reviews both fundamental and cutting edge methodologies reported in the last few years. We introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. Details of the proposed architectures, fusion strategies, main datasets, and competitions are reviewed. Also, we summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, their highlighting features, and opportunities and challenges for future research. To the best of our knowledge this is the first survey in the topic. We foresee this survey will become a reference in this ever dynamic field of research
A Multi-class Classification Strategy for Fisher Scores: Application to Signer Independent Sign Language Recognition
Fisher kernels combine the powers of discriminative and generative classifiers by mapping the variable-length sequences to a new fixed length feature space, called the Fisher score space. The mapping is based on a single generative model and the classifier is intrinsically binary. We propose a multi-class classification strategy that applies a multi-class classification on each Fisher score space and combines the decisions of multi-class classifiers. We experimentally show that the Fisher scores of one class provide discriminative information for the other classes as well. We compare several multi-class classification strategies for Fisher scores generated from the hidden Markov models of sign sequences. The proposed multi-class classification strategy increases the classification accuracy in comparison with the state of the art strategies based on combining binary classifiers. To reduce the computational complexity of the Fisher score extraction and the training phases, we also propose a score space selection method and show that, similar or even higher accuracies can be obtained by using only a subset of the score spaces. Based on the proposed score space selection method, a signer adaptation technique is also presented that does not require any re-training
Learning a Family of Detectors
Object detection and recognition are important problems in computer vision. The challenges of these problems come from the presence of noise, background clutter, large within class variations of the object class and limited training data. In addition, the computational complexity in the recognition process is also a concern in practice. In this thesis, we propose one approach to handle the problem of detecting an object class that exhibits large within-class variations, and a second approach to speed up the classification processes.
In the first approach, we show that foreground-background classification (detection) and within-class classification of the foreground class (pose estimation) can be jointly solved with using a multiplicative form of two kernel functions. One kernel measures similarity for foreground-background classification. The other kernel accounts for latent factors that control within-class variation and implicitly enables feature sharing among foreground training samples. For applications where explicit parameterization of the within-class states is unavailable, a nonparametric formulation of the kernel can be constructed with a proper foreground distance/similarity measure. Detector training is accomplished via standard Support Vector Machine learning. The resulting detectors are tuned to specific variations in the foreground class. They also serve to evaluate hypotheses of the foreground state. When the image masks for foreground objects are provided in training, the detectors can also produce object segmentation. Methods for generating a representative sample set of detectors are proposed that can enable efficient detection and tracking. In addition, because individual detectors verify hypotheses of foreground state, they can also be incorporated in a tracking-by-detection frame work to recover foreground state in image sequences. To run the detectors efficiently at the online stage, an input-sensitive speedup strategy is proposed to select the most relevant detectors quickly. The proposed approach is tested on data sets of human hands, vehicles and human faces. On all data sets, the proposed approach achieves improved detection accuracy over the best competing approaches.
In the second part of the thesis, we formulate a filter-and-refine scheme to speed up recognition processes. The binary outputs of the weak classifiers in a boosted detector are used to identify a small number of candidate foreground state hypotheses quickly via Hamming distance or weighted Hamming distance. The approach is evaluated in three applications: face recognition on the face recognition grand challenge version 2 data set, hand shape detection and parameter estimation on a hand data set, and vehicle detection and estimation of the view angle on a multi-pose vehicle data set. On all data sets, our approach is at least five times faster than simply evaluating all foreground state hypotheses with virtually no loss in classification accuracy
Empirical Validation of Objective Functions in Feature Selection Based on Acceleration Motion Segmentation Data
Recent change in evaluation criteria from accuracy alone to trade-off with time delay has inspired multivariate energy-based approaches in motion segmentation using acceleration. The essence of multivariate approaches lies in the construction of highly dimensional energy and requires feature subset selection in machine learning. Due to fast process, filter methods are preferred; however, their poorer estimate is of the main concerns. This paper aims at empirical validation of three objective functions for filter approaches, Fisher discriminant ratio, multiple correlation (MC), and mutual information (MI), through two subsequent experiments. With respect to 63 possible subsets out of 6 variables for acceleration motion segmentation, three functions in addition to a theoretical measure are compared with two wrappers, k-nearest neighbor and Bayes classifiers in general statistics and strongly relevant variable identification by social network analysis. Then four kinds of new proposed multivariate energy are compared with a conventional univariate approach in terms of accuracy and time delay. Finally it appears that MC and MI are acceptable enough to match the estimate of two wrappers, and multivariate approaches are justified with our analytic procedures
BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
Recent progress in fine-grained gesture and action classification, and
machine translation, point to the possibility of automated sign language
recognition becoming a reality. A key stumbling block in making progress
towards this goal is a lack of appropriate training data, stemming from the
high complexity of sign annotation and a limited supply of qualified
annotators. In this work, we introduce a new scalable approach to data
collection for sign recognition in continuous videos. We make use of
weakly-aligned subtitles for broadcast footage together with a keyword spotting
method to automatically localise sign-instances for a vocabulary of 1,000 signs
in 1,000 hours of video. We make the following contributions: (1) We show how
to use mouthing cues from signers to obtain high-quality annotations from video
data - the result is the BSL-1K dataset, a collection of British Sign Language
(BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train
strong sign recognition models for co-articulated signs in BSL and that these
models additionally form excellent pretraining for other sign languages and
benchmarks - we exceed the state of the art on both the MSASL and WLASL
benchmarks. Finally, (3) we propose new large-scale evaluation sets for the
tasks of sign recognition and sign spotting and provide baselines which we hope
will serve to stimulate research in this area.Comment: Appears in: European Conference on Computer Vision 2020 (ECCV 2020).
28 page
A robust and efficient video representation for action recognition
This paper introduces a state-of-the-art video representation and applies it
to efficient action recognition and detection. We first propose to improve the
popular dense trajectory features by explicit camera motion estimation. More
specifically, we extract feature point matches between frames using SURF
descriptors and dense optical flow. The matches are used to estimate a
homography with RANSAC. To improve the robustness of homography estimation, a
human detector is employed to remove outlier matches from the human body as
human motion is not constrained by the camera. Trajectories consistent with the
homography are considered as due to camera motion, and thus removed. We also
use the homography to cancel out camera motion from the optical flow. This
results in significant improvement on motion-based HOF and MBH descriptors. We
further explore the recent Fisher vector as an alternative feature encoding
approach to the standard bag-of-words histogram, and consider different ways to
include spatial layout information in these encodings. We present a large and
varied set of evaluations, considering (i) classification of short basic
actions on six datasets, (ii) localization of such actions in feature-length
movies, and (iii) large-scale recognition of complex events. We find that our
improved trajectory features significantly outperform previous dense
trajectories, and that Fisher vectors are superior to bag-of-words encodings
for video recognition tasks. In all three tasks, we show substantial
improvements over the state-of-the-art results
- …