366 research outputs found

    Grassmann Learning for Recognition and Classification

    Get PDF
    Computational performance associated with high-dimensional data is a common challenge for real-world classification and recognition systems. Subspace learning has received considerable attention as a means of finding an efficient low-dimensional representation that leads to better classification and efficient processing. A Grassmann manifold is a space that promotes smooth surfaces, where points represent subspaces and the relationship between points is defined by a mapping of an orthogonal matrix. Grassmann learning involves embedding high dimensional subspaces and kernelizing the embedding onto a projection space where distance computations can be effectively performed. In this dissertation, Grassmann learning and its benefits towards action classification and face recognition in terms of accuracy and performance are investigated and evaluated. Grassmannian Sparse Representation (GSR) and Grassmannian Spectral Regression (GRASP) are proposed as Grassmann inspired subspace learning algorithms. GSR is a novel subspace learning algorithm that combines the benefits of Grassmann manifolds with sparse representations using least squares loss §¤1-norm minimization for improved classification. GRASP is a novel subspace learning algorithm that leverages the benefits of Grassmann manifolds and Spectral Regression in a framework that supports high discrimination between classes and achieves computational benefits by using manifold modeling and avoiding eigen-decomposition. The effectiveness of GSR and GRASP is demonstrated for computationally intensive classification problems: (a) multi-view action classification using the IXMAS Multi-View dataset, the i3DPost Multi-View dataset, and the WVU Multi-View dataset, (b) 3D action classification using the MSRAction3D dataset and MSRGesture3D dataset, and (c) face recognition using the ATT Face Database, Labeled Faces in the Wild (LFW), and the Extended Yale Face Database B (YALE). Additional contributions include the definition of Motion History Surfaces (MHS) and Motion Depth Surfaces (MDS) as descriptors suitable for activity representations in video sequences and 3D depth sequences. An in-depth analysis of Grassmann metrics is applied on high dimensional data with different levels of noise and data distributions which reveals that standardized Grassmann kernels are favorable over geodesic metrics on a Grassmann manifold. Finally, an extensive performance analysis is made that supports Grassmann subspace learning as an effective approach for classification and recognition

    Unsupervised object candidate discovery for activity recognition

    Get PDF
    Die automatische Interpretation menschlicher Bewegungsabläufe auf Basis von Videos ist ein wichtiger Bestandteil vieler Anwendungen im Bereich des Maschinellen Sehens, wie zum Beispiel Mensch-Roboter Interaktion, Videoüberwachung, und inhaltsbasierte Analyse von Multimedia Daten. Anders als die meisten Ansätze auf diesem Gebiet, die hauptsächlich auf die Klassifikation von einfachen Aktionen, wie Aufstehen, oder Gehen ausgerichtet sind, liegt der Schwerpunkt dieser Arbeit auf der Erkennung menschlicher Aktivitäten, d.h. komplexer Aktionssequenzen, die meist Interaktionen des Menschen mit Objekten beinhalten. Gemäß der Aktionsidentifikationstheorie leiten menschliche Aktivitäten ihre Bedeutung nicht nur von den involvierten Bewegungsmustern ab, sondern vor allem vom generellen Kontext, in dem sie stattfinden. Zu diesen kontextuellen Informationen gehören unter anderem die Gesamtheit aller vorher furchgeführter Aktionen, der Ort an dem sich die aktive Person befindet, sowie die Menge der Objekte, die von ihr manipuliert werden. Es ist zum Beispiel nicht möglich auf alleiniger Basis von Bewegungsmustern und ohne jeglicher Miteinbeziehung von Objektwissen zu entschieden ob eine Person, die ihre Hand zum Mund führt gerade etwas isst oder trinkt, raucht, oder bloß die Lippen abwischt. Die meisten Arbeiten auf dem Gebiet der computergestützten Aktons- und Aktivitätserkennung ignorieren allerdings jegliche durch den Kontext bedingte Informationen und beschränken sich auf die Identifikation menschlicher Aktivitäten auf Basis der beobachteten Bewegung. Wird jedoch Objektwissen für die Klassifikation miteinbezogen, so geschieht dies meist unter Zuhilfenahme von überwachten Detektoren, für deren Einrichtung widerum eine erhebliche Menge an Trainingsdaten erforderlich ist. Bedingt durch die hohen zeitlichen Kosten, die die Annotation dieser Trainingsdaten mit sich bringt, wird das Erweitern solcher Systeme, zum Beispiel durch das Hinzufügen neuer Typen von Aktionen, zum eigentlichen Flaschenhals. Ein weiterer Nachteil des Hinzuziehens von überwacht trainierten Objektdetektoren, ist deren Fehleranfälligkeit, selbst wenn die verwendeten Algorithmen dem neuesten Stand der Technik entsprechen. Basierend auf dieser Beobachtung ist das Ziel dieser Arbeit die Leistungsfähigkeit computergestützter Aktivitätserkennung zu verbessern mit Hilfe der Hinzunahme von Objektwissen, welches im Gegensatz zu den bisherigen Ansätzen ohne überwachten Trainings gewonnen werden kann. Wir Menschen haben die bemerkenswerte Fähigkeit selektiv die Aufmerksamkeit auf bestimmte Regionen im Blickfeld zu fokussieren und gleichzeitig nicht relevante Regionen auszublenden. Dieser kognitive Prozess erlaubt es uns unsere beschränkten Bewusstseinsressourcen unbewusst auf Inhalte zu richten, die anschließend durch das Gehirn ausgewertet werden. Zum Beispiel zur Interpretation visueller Muster als Objekte eines bestimmten Typs. Die Regionen im Blickfeld, die unsere Aufmerksamkeit unbewusst anziehen werden als Proto-Objekte bezeichnet. Sie sind definiert als unbestimmte Teile des visuellen Informationsspektrums, die zu einem späteren Zeitpunkt durch den Menschen als tatsächliche Objekte wahrgenommen werden können, wenn er seine Aufmerksamkeit auf diese richtet. Einfacher ausgedrückt: Proto-Objekte sind Kandidaten für Objekte, oder deren Bestandteile, die zwar lokalisiert aber noch nicht identifiziert wurden. Angeregt durch die menschliche Fähigkeit solche visuell hervorstechenden (salienten) Regionen zuverlässig vom Hintergrund zu unterscheiden, haben viele Wissenschaftler Methoden entwickelt, die es erlauben Proto-Objekte zu lokalisieren. Allen diesen Algorithmen ist gemein, dass möglichst wenig statistisches Wissens über tatsächliche Objekte vorausgesetzt wird. Visuelle Aufmerksamkeit und Objekterkennung sind sehr eng miteinander vernküpfte Prozesse im visuellen System des Menschen. Aus diesem Grund herrscht auf dem Gebiet des Maschinellen Sehens ein reges Interesse an der Integration beider Konzepte zur Erhöhung der Leistung aktueller Bilderkennungssysteme. Die im Rahmen dieser Arbeit entwickelten Methoden gehen in eine ähnliche Richtung: wir demonstrieren, dass die Lokalisation von Proto-Objekten es erlaubt Objektkandidaten zu finden, die geeignet sind als zusätzliche Modalität zu dienen für die bewegungsbasierte Erkennung menschlicher Aktivitäten. Die Grundlage dieser Arbeit bildet dabei ein sehr effizienter Algorithmus, der die visuelle Salienz mit Hilfe von quaternionenbasierten DCT Bildsignaturen approximiert. Zur Extraktion einer Menge geeigneter Objektkandidaten (d.h. Proto-Objekten) aus den resultierenden Salienzkarten, haben wir eine Methode entwickelt, die den kognitiven Mechanismus des Inhibition of Return implementiert. Die auf diese Weise gewonnenen Objektkandidaten nutzen wir anschliessend in Kombination mit state-of-the-art Bag-of-Words Methoden zur Merkmalsbeschreibung von Bewegungsmustern um komplexe Aktivitäten des täglichen Lebens zu klassifizieren. Wir evaluieren das im Rahmen dieser Arbeit entwickelte System auf diversen häufig genutzten Benchmark-Datensätzen und zeigen experimentell, dass das Miteinbeziehen von Proto-Objekten für die Aktivitätserkennung zu einer erheblichen Leistungssteigerung führt im Vergleich zu rein bewegungsbasierten Ansätzen. Zudem demonstrieren wir, dass das vorgestellte System bei der Erkennung menschlicher Aktivitäten deutlich weniger Fehler macht als eine Vielzahl von Methoden, die dem aktuellen Stand der Technik entsprechen. Überraschenderweise übertrifft unser System leistungsmäßig sogar Verfahren, die auf Objektwissen aufbauen, welches von überwacht trainierten Detektoren, oder manuell erstellten Annotationen stammt. Benchmark-Datensätze sind ein sehr wichtiges Mittel zum quantitativen Vergleich von computergestützten Mustererkennungsverfahren. Nach einer Überprüfung aller öffentlich verfügbaren, relevanten Benchmarks, haben wir jedoch festgestellt, dass keiner davon geeignet war für eine detaillierte Evaluation von Methoden zur Erkennung komplexer, menschlicher Aktivitäten. Aus diesem Grund bestand ein Teil dieser Arbeit aus der Konzeption und Aufnahme eines solchen Datensatzes, des KIT Robo-kitchen Benchmarks. Wie der Name vermuten lässt haben wir uns dabei für ein Küchenszenario entschieden, da es ermöglicht einen großen Umfang an Aktivitäten des täglichen Lebens einzufangen, von denen viele Objektmanipulationen enthalten. Um eine möglichst umfangreiche Menge natürlicher Bewegungen zu erhalten, wurden die Teilnehmer während der Aufnahmen kaum eingeschränkt in der Art und Weise wie die diversen Aktivitäten auszuführen sind. Zu diesem Zweck haben wir den Probanden nur die Art der auszuführenden Aktivität mitgeteilt, sowie wo die benötigten Gegenstände zu finden sind, und ob die jeweilige Tätigkeit am Küchentisch oder auf der Arbeitsplatte auszuführen ist. Dies hebt KIT Robo-kitchen deutlich hervor gegenüber den meisten existierenden Datensätzen, die sehr unrealistisch gespielte Aktivitäten enthalten, welche unter Laborbedingungen aufgenommen wurden. Seit seiner Veröffentlichung wurde der resultierende Benchmark mehrfach verwendet zur Evaluation von Algorithmen, die darauf abzielen lang andauerne, realistische, komplexe, und quasi-periodische menschliche Aktivitäten zu erkennen

    REPRESENTATION LEARNING FOR ACTION RECOGNITION

    Get PDF
    The objective of this research work is to develop discriminative representations for human actions. The motivation stems from the fact that there are many issues encountered while capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration), inter-action similarity, background motion, and occlusion of actors. Hence, obtaining a representation which can address all the variations in the same action while maintaining discrimination with other actions is a challenging task. In literature, actions have been represented either using either low-level or high-level features. Low-level features describe the motion and appearance in small spatio-temporal volumes extracted from a video. Due to the limited space-time volume used for extracting low-level features, they are not able to account for viewpoint and actor variations or variable length actions. On the other hand, high-level features handle variations in actors, viewpoints, and duration but the resulting representation is often high-dimensional which introduces the curse of dimensionality. In this thesis, we propose new representations for describing actions by combining the advantages of both low-level and high-level features. Specifically, we investigate various linear and non-linear decomposition techniques to extract meaningful attributes in both high-level and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build action-specific dictionaries. Each dictionary retains only the discriminative information for a particular action and hence reduces inter-action similarity. Then, a sparsity-based classification method is proposed to classify the low-rank representation of clips obtained using these dictionaries. We show that this representation based on dictionary learning improves the classification performance across actions. Also, a few of the actions consist of rapid body deformations that hinder the extraction of local features from body movements. Hence, we propose to use a dictionary which is trained on convolutional neural network (CNN) features of the human body in various poses to reliably identify actors from the background. Particularly, we demonstrate the efficacy of sparse representation in the identification of the human body under rapid and substantial deformation. In the first two approaches, sparsity-based representation is developed to improve discriminability using class-specific dictionaries that utilize action labels. However, developing an unsupervised representation of actions is more beneficial as it can be used to both recognize similar actions and localize actions. We propose to exploit inter-action similarity to train a universal attribute model (UAM) in order to learn action attributes (common and distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation, a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains redundant attributes of all other actions, we use factor analysis to extract a novel lowvi dimensional action-vector representation for each clip. Action-vectors are shown to suppress background motion and highlight actions of interest in both trimmed and untrimmed clips that contributes to action recognition without the help of any classifiers. It is observed during our experiments that action-vector cannot effectively discriminate between actions which are visually similar to each other. Hence, we subject action-vectors to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary information across action-vectors using different local features followed by discriminative embedding provides the best classification performance. Further, we explore non-linear embedding of action-vectors using Siamese networks especially for fine-grained action recognition. A visualization of the hidden layer output in Siamese networks shows its ability to effectively separate visually similar actions. This leads to better classification performance than linear embedding on fine-grained action recognition. All of the above approaches are presented on large unconstrained datasets with hundreds of examples per action. However, actions in surveillance videos like snatch thefts are difficult to model because of the diverse variety of scenarios in which they occur and very few labeled examples. Hence, we propose to utilize the universal attribute model (UAM) trained on large action datasets to represent such actions. Specifically, we show that there are similarities between certain actions in the large datasets with snatch thefts which help in extracting a representation for snatch thefts using the attributes from the UAM. This representation is shown to be effective in distinguishing snatch thefts from regular actions with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing actions which provide better discrimination than existing representations. The first approach presents a dictionary learning based sparse representation for effective discrimination of actions. Also, we propose a sparse representation for the human body based on dictionaries in order to recognize actions with rapid body deformations. In the next approach, a low-dimensional representation called action-vector for unsupervised action recognition is presented. Further, linear and non-linear embedding of action-vectors is proposed for addressing inter-action similarity and fine-grained action recognition, respectively. Finally, we propose a representation for locating snatch thefts among thousands of regular interactions in surveillance videos

    Learning Discriminative Feature Representations for Visual Categorization

    Get PDF
    Learning discriminative feature representations has attracted a great deal of attention due to its potential value and wide usage in a variety of areas, such as image/video recognition and retrieval, human activities analysis, intelligent surveillance and human-computer interaction. In this thesis we first introduce a new boosted key-frame selection scheme for action recognition. Specifically, we propose to select a subset of key poses for the representation of each action via AdaBoost and a new classifier, namely WLNBNN, is then developed for final classification. The experimental results of the proposed method are 0.6% - 13.2% better than previous work. After that, a domain-adaptive learning approach based on multiobjective genetic programming (MOGP) has been developed for image classification. In this method, a set of primitive 2-D operators are randomly combined to construct feature descriptors through the MOGP evolving and then evaluated by two objective fitness criteria, i.e., the classification error and the tree complexity. Later, the (near-)optimal feature descriptor can be obtained. The proposed approach can achieve 0.9% ∼ 25.9% better performance compared with state-of-the-art methods. Moreover, effective dimensionality reduction algorithms have also been widely used for obtaining better representations. In this thesis, we have proposed a novel linear unsupervised algorithm, termed Discriminative Partition Sparsity Analysis (DPSA), explicitly considering different probabilistic distributions that exist over the data points, simultaneously preserving the natural locality relationship among the data. All these above methods have been systematically evaluated on several public datasets, showing their accurate and robust performance (0.44% - 6.69% better than the previous) for action and image categorization. Targeting efficient image classification , we also introduce a novel unsupervised framework termed evolutionary compact embedding (ECE) which can automatically learn the task-specific binary hash codes. It is regarded as an optimization algorithm which combines the genetic programming (GP) and a boosting trick. The experimental results manifest ECE significantly outperform others by 1.58% - 2.19% for classification tasks. In addition, a supervised framework, bilinear local feature hashing (BLFH), has also been proposed to learn highly discriminative binary codes on the local descriptors for large-scale image similarity search. We address it as a nonconvex optimization problem to seek orthogonal projection matrices for hashing, which can successfully preserve the pairwise similarity between different local features and simultaneously take image-to-class (I2C) distances into consideration. BLFH produces outstanding results (0.017% - 0.149% better) compared to the state-of-the-art hashing techniques

    Video content analysis for intelligent forensics

    Get PDF
    The networks of surveillance cameras installed in public places and private territories continuously record video data with the aim of detecting and preventing unlawful activities. This enhances the importance of video content analysis applications, either for real time (i.e. analytic) or post-event (i.e. forensic) analysis. In this thesis, the primary focus is on four key aspects of video content analysis, namely; 1. Moving object detection and recognition, 2. Correction of colours in the video frames and recognition of colours of moving objects, 3. Make and model recognition of vehicles and identification of their type, 4. Detection and recognition of text information in outdoor scenes. To address the first issue, a framework is presented in the first part of the thesis that efficiently detects and recognizes moving objects in videos. The framework targets the problem of object detection in the presence of complex background. The object detection part of the framework relies on background modelling technique and a novel post processing step where the contours of the foreground regions (i.e. moving object) are refined by the classification of edge segments as belonging either to the background or to the foreground region. Further, a novel feature descriptor is devised for the classification of moving objects into humans, vehicles and background. The proposed feature descriptor captures the texture information present in the silhouette of foreground objects. To address the second issue, a framework for the correction and recognition of true colours of objects in videos is presented with novel noise reduction, colour enhancement and colour recognition stages. The colour recognition stage makes use of temporal information to reliably recognize the true colours of moving objects in multiple frames. The proposed framework is specifically designed to perform robustly on videos that have poor quality because of surrounding illumination, camera sensor imperfection and artefacts due to high compression. In the third part of the thesis, a framework for vehicle make and model recognition and type identification is presented. As a part of this work, a novel feature representation technique for distinctive representation of vehicle images has emerged. The feature representation technique uses dense feature description and mid-level feature encoding scheme to capture the texture in the frontal view of the vehicles. The proposed method is insensitive to minor in-plane rotation and skew within the image. The capability of the proposed framework can be enhanced to any number of vehicle classes without re-training. Another important contribution of this work is the publication of a comprehensive up to date dataset of vehicle images to support future research in this domain. The problem of text detection and recognition in images is addressed in the last part of the thesis. A novel technique is proposed that exploits the colour information in the image for the identification of text regions. Apart from detection, the colour information is also used to segment characters from the words. The recognition of identified characters is performed using shape features and supervised learning. Finally, a lexicon based alignment procedure is adopted to finalize the recognition of strings present in word images. Extensive experiments have been conducted on benchmark datasets to analyse the performance of proposed algorithms. The results show that the proposed moving object detection and recognition technique superseded well-know baseline techniques. The proposed framework for the correction and recognition of object colours in video frames achieved all the aforementioned goals. The performance analysis of the vehicle make and model recognition framework on multiple datasets has shown the strength and reliability of the technique when used within various scenarios. Finally, the experimental results for the text detection and recognition framework on benchmark datasets have revealed the potential of the proposed scheme for accurate detection and recognition of text in the wild

    Crowd Scene Analysis in Video Surveillance

    Get PDF
    There is an increasing interest in crowd scene analysis in video surveillance due to the ubiquitously deployed video surveillance systems in public places with high density of objects amid the increasing concern on public security and safety. A comprehensive crowd scene analysis approach is required to not only be able to recognize crowd events and detect abnormal events, but also update the innate learning model in an online, real-time fashion. To this end, a set of approaches for Crowd Event Recognition (CER) and Abnormal Event Detection (AED) are developed in this thesis. To address the problem of curse of dimensionality, we propose a video manifold learning method for crowd event analysis. A novel feature descriptor is proposed to encode regional optical flow features of video frames, where adaptive quantization and binarization of the feature code are employed to improve the discriminant ability of crowd motion patterns. Using the feature code as input, a linear dimensionality reduction algorithm that preserves both the intrinsic spatial and temporal properties is proposed, where the generated low-dimensional video manifolds are conducted for CER and AED. Moreover, we introduce a framework for AED by integrating a novel incremental and decremental One-Class Support Vector Machine (OCSVM) with a sliding buffer. It not only updates the model in an online fashion with low computational cost, but also adapts to concept drift by discarding obsolete patterns. Furthermore, the framework has been improved by introducing Multiple Incremental and Decremental Learning (MIDL), kernel fusion, and multiple target tracking, which leads to more accurate and faster AED. In addition, we develop a framework for another video content analysis task, i.e., shot boundary detection. Specifically, instead of directly assessing the pairwise difference between consecutive frames over time, we propose to evaluate a divergence measure between two OCSVM classifiers trained on two successive frame sets, which is more robust to noise and gradual transitions such as fade-in and fade-out. To speed up the processing procedure, the two OCSVM classifiers are updated online by the MIDL proposed for AED. Extensive experiments on five benchmark datasets validate the effectiveness and efficiency of our approaches in comparison with the state of the art

    Joint optimization of manifold learning and sparse representations for face and gesture analysis

    Get PDF
    Face and gesture understanding algorithms are powerful enablers in intelligent vision systems for surveillance, security, entertainment, and smart spaces. In the future, complex networks of sensors and cameras may disperse directions to lost tourists, perform directory lookups in the office lobby, or contact the proper authorities in case of an emergency. To be effective, these systems will need to embrace human subtleties while interacting with people in their natural conditions. Computer vision and machine learning techniques have recently become adept at solving face and gesture tasks using posed datasets in controlled conditions. However, spontaneous human behavior under unconstrained conditions, or in the wild, is more complex and is subject to considerable variability from one person to the next. Uncontrolled conditions such as lighting, resolution, noise, occlusions, pose, and temporal variations complicate the matter further. This thesis advances the field of face and gesture analysis by introducing a new machine learning framework based upon dimensionality reduction and sparse representations that is shown to be robust in posed as well as natural conditions. Dimensionality reduction methods take complex objects, such as facial images, and attempt to learn lower dimensional representations embedded in the higher dimensional data. These alternate feature spaces are computationally more efficient and often more discriminative. The performance of various dimensionality reduction methods on geometric and appearance based facial attributes are studied leading to robust facial pose and expression recognition models. The parsimonious nature of sparse representations (SR) has successfully been exploited for the development of highly accurate classifiers for various applications. Despite the successes of SR techniques, large dictionaries and high dimensional data can make these classifiers computationally demanding. Further, sparse classifiers are subject to the adverse effects of a phenomenon known as coefficient contamination, where for example variations in pose may affect identity and expression recognition. This thesis analyzes the interaction between dimensionality reduction and sparse representations to present a unified sparse representation classification framework that addresses both issues of computational complexity and coefficient contamination. Semi-supervised dimensionality reduction is shown to mitigate the coefficient contamination problems associated with SR classifiers. The combination of semi-supervised dimensionality reduction with SR systems forms the cornerstone for a new face and gesture framework called Manifold based Sparse Representations (MSR). MSR is shown to deliver state-of-the-art facial understanding capabilities. To demonstrate the applicability of MSR to new domains, MSR is expanded to include temporal dynamics. The joint optimization of dimensionality reduction and SRs for classification purposes is a relatively new field. The combination of both concepts into a single objective function produce a relation that is neither convex, nor directly solvable. This thesis studies this problem to introduce a new jointly optimized framework. This framework, termed LGE-KSVD, utilizes variants of Linear extension of Graph Embedding (LGE) along with modified K-SVD dictionary learning to jointly learn the dimensionality reduction matrix, sparse representation dictionary, sparse coefficients, and sparsity-based classifier. By injecting LGE concepts directly into the K-SVD learning procedure, this research removes the support constraints K-SVD imparts on dictionary element discovery. Results are shown for facial recognition, facial expression recognition, human activity analysis, and with the addition of a concept called active difference signatures, delivers robust gesture recognition from Kinect or similar depth cameras

    Human Motion Analysis for Efficient Action Recognition

    Get PDF
    Automatic understanding of human actions is at the core of several application domains, such as content-based indexing, human-computer interaction, surveillance, and sports video analysis. The recent advances in digital platforms and the exponential growth of video and image data have brought an urgent quest for intelligent frameworks to automatically analyze human motion and predict their corresponding action based on visual data and sensor signals. This thesis presents a collection of methods that targets human action recognition using different action modalities. The first method uses the appearance modality and classifies human actions based on heterogeneous global- and local-based features of scene and humanbody appearances. The second method harnesses 2D and 3D articulated human poses and analyizes the body motion using a discriminative combination of the parts’ velocities, locations, and correlations histograms for action recognition. The third method presents an optimal scheme for combining the probabilistic predictions from different action modalities by solving a constrained quadratic optimization problem. In addition to the action classification task, we present a study that compares the utility of different pose variants in motion analysis for human action recognition. In particular, we compare the recognition performance when 2D and 3D poses are used. Finally, we demonstrate the efficiency of our pose-based method for action recognition in spotting and segmenting motion gestures in real time from a continuous stream of an input video for the recognition of the Italian sign gesture language
    corecore