11 research outputs found

    Comparative Evaluation of Action Recognition Methods via Riemannian Manifolds, Fisher Vectors and GMMs: Ideal and Challenging Conditions

    Full text link
    We present a comparative evaluation of various techniques for action recognition while keeping as many variables as possible controlled. We employ two categories of Riemannian manifolds: symmetric positive definite matrices and linear subspaces. For both categories we use their corresponding nearest neighbour classifiers, kernels, and recent kernelised sparse representations. We compare against traditional action recognition techniques based on Gaussian mixture models and Fisher vectors (FVs). We evaluate these action recognition techniques under ideal conditions, as well as their sensitivity in more challenging conditions (variations in scale and translation). Despite recent advancements for handling manifolds, manifold based techniques obtain the lowest performance and their kernel representations are more unstable in the presence of challenging conditions. The FV approach obtains the highest accuracy under ideal conditions. Moreover, FV best deals with moderate scale and translation changes

    Object and action detection methods using MOSSE filters

    Get PDF
    2012 Fall.Includes bibliographical references.In this thesis we explore the application of the Minimum Output Sum of Squared Error (MOSSE) filter to object detection in images as well as action detection in video. We exploit the properties of the Fourier transform for computing correlations in two and three dimensions. We perform a comprehensive examination of the shape parameters of the desired target response and determine values to optimize the filter performance for specific objects and actions. In addition, we propose the Gaussian Iterative Response (GIR) algorithm and the Multi-Sigma Geometric Mean method to improve the MOSSE filter response on test signals. Also, new detection criteria are investigated and shown to boost the detection accuracy on two well-known data sets

    A template based approach for human action recognition

    Get PDF
    Visual analysis of human movements concerns the understanding of human activities from image sequences. The goal of the action/gesture recognition is to recognize the label that corresponds to an action or gesture made by a human in a sequence of images. To solve this problem, the researchers have proposed solutions that range from object recognition techniques, to speech recognition techniques, face recognition or brain function . The techniques presented in this thesis, are related to a set of techniques that condense a video sequence into a template that retain important information to action/gestures classification applying standard object recognition techniques. In a first stage of this thesis, we have proposed a view-based temporal template approach for action/gesture representation from tensors. The templates are computed from three different projections considering a video sequence as a third-order tensor. We compute each projection from the fibers of the tensor using a combination of simple functions . We have studied which function and feature extractor/descriptor is the most suitable to project the template from the tensor. We have tested five different simple functions used to project the fibers, namely, supremum, mean, standard deviation, skewness and kurtosis using public datasets. We have also studied the performance obtained applying four feature extractors/descriptors like PHOW, LIOP, HOG and SMFs. Using more complex datasets, we have assessed the most suitable feature representation for our templates (Bag Of Words or Fisher Vectors) and the complementarity among the features computed from each simple function (Max, Mean, Standard Deviation, Kurtosis y Skewness). Finally, we have studied the comptementarity with a successful technique like Improved Dense Trajectories. The experiments have shown that Standard Deviation function and PHOW extractor/descriptor are the most suitable for our templates. The results have shown also that our 3 projection templates overcome most state-of-the-art techniques in more complex datasets when we combine the templates with Fisher Vector representation . The features extracted by each simple function are complementary among them and that added to HOG, HOF and MBH improves the performance of IDTs. Derived from this thesis, we have also presented another view-based temporal temptate approach for action recognition obtained from a Radon transform projection and that allows the temporal segmentation of human actions in real time. First, we propose a generalization of the R transform that it is useful to adapt the transform to the problem to be solve. We have studied the performance in three functions, namely, Max, Mean and Standard Deviation for pre-segmentad human action recognition using a public dataset, and we have compared the results against traditional R transform . The results have shown that Maxfunction obtains the best performance when it is applied on Radon transform and that our technique overcomes many state-of-the-art techniques in action recognition. In a second stage, we have modified the classifier to adapt it to temporal segmentation of human actions. To assess the performance, we have merged Weizman and Hollywood actions datasets and we have measured the performance of the method to identify isolated actions. The experiments have shown that our technique overcomes the state-of-the-art techniques in Weizman dataset in no pre-segmented human actions.El análisis visual de movimientos humanos hace referencia al entendimiento de la actividad humana en secuencias de video. El objetivo del reconocimiento de acciones/gestos en ámbito de la Visión por Computador, es identificar el nombre que corresponde a una acción o gesto realizado en una secuencia de imágenes. Para dar solución a este problema, los investigadores han propuesto soluciones que van desde la aplicación de técnicas que derivan del reconocimiento de objetos, del reconocimiento del habla, del reconocimiento facial o del funcionamiento del cerebro. Las técnicas presentadas en esta tesis, están relacionadas con un conjunto de técnicas que intentan condensar una secuencia de video en unas templates que retienen información importante de cara a la discriminación entre acciones/gestos aplicando técnicas estándar de reconocimiento de objetos. En la primera parte de esta tesis, hemos propuesto una aproximación basada en template para la representación de acciones/gestos a partir de tensores. Nuestras templates se calculan desde tres proyecciones diferentes considerando una secuencia de vídeo como un tensor de tercer orden. Calculamos cada proyección desde las fibras del tensor de tercer orden utilizando funciones simples. Hemos hecho un estudio exhaustivo para encontrar qué función debe ser utilizada para proyectar el template desde el tensor, y qué extractor/descriptor es el más adecuado. Utilizando datasets públicos simples, hemos testeado cinco funciones diferentes simples para proyectar las fibras, llamadas, Max, Mean, Standard Deviation, Kurtosis y Skewness. Hemos estudiado también el rendimiento obtenido aplicando a nuestras templates, cuatro técnicas de extracción/descripción de características del estado del arte como PHOW, LIOP, HOG y SMFs. Utilizando datasets más complejos, hemos estudiado cuál es la mejor representación de las características extraídas de las templates (Bag Of Words o Fisher Vectores), y la complementariedad entre las características extraídas con cada una de las cinco funciones (Max, Mean, Standard Deviation, Kurtosis y Skewness) y la complementariedad de estas con una exitosa técnica como Improved Dense Trajectories. Los experimentos han demostrado que la desviación estándar es la mejor función para proyectar las fibras en las templates, y que PHOW obtiene el mejor rendimiento como detector/descriptor en las templates obtenidas. Los datasets más complejos han mostrado que la mejor representación para las características extraídas de las templates es Fisher Vectores, que existe complementariedad entre las características extraídas con cada una de las funciones y que la fusión de estas características con Improved Dense Trajectories, hace que este último mejore su rendimiento. Derivado de los trabajos de esta tesis, también presentamos otra aproximación basada en template por el reconocimiento de acciones/gestos que se obtiene de una proyección derivada de la transformada de Radon y que permite la segmentación temporal de acciones en tiempo real. Primero hemos planteado una generalización de la transformada R que permite adaptar la transformada al problema a resolver mediante la función de proyección. Hemos estudiado su rendimiento para las funciones Max, Mean y Standard Deviation en reconocimiento de acciones pre-segmentadas sobre un dataset público y comparado los resultados con la transformada R. Los resultados han mostrado que la función Max obtiene el mejor resultado cuando se aplica sobre la transformada de Radon y que nuestra técnica supera a muchos métodos del estado del arte en reconocimiento de acciones. En una segunda fase, hemos introducido una modificación en la etapa de clasificación de nuestra técnica para permitir segmentar acciones temporalmente. Para evaluar su rendimiento, hemos concatenado acciones de los datasets Weizmann y Hollywood y medido la capacidad de la técnica para identificar cada una de las acciones individuales. Los experimentos han demostrado que nuestra técnica rinde mejor en la segmentación de acciones del Weizmann dataset que las técnicas del estado del arteL’anàlisi visual de moviments humans fa referència al enteniment d’activitat humana en seqüències de vídeo. L’objectiu del reconeixement d’accions/gestos en l’àmbit de la Visió per Computador, és identificar el nom que correspon a una acció o gest realitzat en una seqüència d’imatges. Per donar solució a aquest problema, els investigadors han proposat solucions que van des de l’aplicació de tècniques que deriven del reconeixement d’objectes, del reconeixement de la parla, del reconeixement facial o del funcionament del cervell. Les tècniques presentades en aquesta tesi, estan relacionades amb un conjunt de tècniques que intenten condensar una seqüència de vídeo en uns templates que retinguin informació important de cara a la discriminació entre accions/gestos aplicant tècniques estàndards de reconeixement d’objectes. A la primera part d’aquesta tesi, hem proposat una aproximació basada en template per la representació d’accions/gestos a partir de tensors. Les nostres templates es calculen des de tres projeccions diferents considerant una seqüència de vídeo com un tensor de tercer ordre. Calculem cada projecció des de les fibres del tensor de tercer ordre utilitzant funcions simples. Hem fet un estudi exhaustiu per trobar quina funció ha de ser utilitzada per projectar el template des del tensor, i quin extractor/descriptor és el més adequat. Utilitzant datasets públics simples, hem testejat cinc funcions diferents simples per projectar les fibres, anomenades, Max, Mean, Standard Deviation, Kurtosi i Skewness. Hem estudiat també el rendiment obtingut aplicant a les nostres templates, quatre tècniques d’extracció/descripció de característiques de l’estat de l’art com PHOW, LIOP, HOG i SMFs. Utilitzant datasets més complexes, hem estudiat quina és la millor representació de les característiques extretes de les templates (Bag Of Words o Fisher Vectors) i la complementarietat entre les característiques extretes amb cada una de les cinc funcions (Max, Mean, Standard Deviation, Kurtosi i Skewness) i la complementarietat d’aquestes amb una exitosa tècnica com Improved Dense Trajectories. Els experiments han demostrat que la desviació estàndard és la millor funció per projectar les fibres en les templates, i que PHOW obté el millor rendiment com a detector/descriptor en les templates obtingudes. Els datasets més complexes han mostrat que la millor representació per a les característiques extretes de les templates és amb Fisher Vectors, que existeix complementarietat entre les característiques extretes amb cada una de les funcions i que la fusió d’aquestes característiques amb Improved Dense Trajectories, fa que aquest últim millori el seu rendiment. Derivat dels treballs d’aquesta tesi, també presentem una altre aproximació basada en template pel reconeixement d’accions/gestos que s’obté d’una projecció derivada de la transformada de Radon i que permet la segmentació temporal d’accions en temps real. Primer hem plantejat una generalització de la transformada R que permet adaptar la transformada al problema a resoldre mitjançant la funció de projecció. Hem estudiat el seu rendiment per a les funcions Max, Mean i Standard Deviation en reconeixement d’accions pre-segmentades sobre un dataset públic i comparat els resultats amb la transformada R. Els resultats han mostrat que la funció Max obté el millor resultat quan s’aplica sobre la transformada de Radon i que la nostra tècnica supera a molts mètodes de l’estat de l’art en reconeixement d’accions. A una segona fase, hem introduït una modificació a la etapa de classificació de la nostra tècnica per permetre segmentar accions temporalment. Per avaluar el seu rendiment, hem concatenat accions dels datasets Weizmann i Hollywood i mesurat la capacitat de la tècnica per identificar cadascuna de les accions individuals. Els experiments han demostrat que la nostra tècnica rendeix millor en la segmentació de les accions del dataset Weizmann que les tècniques de l’estat de l’art.Postprint (published version

    DICTIONARIES AND MANIFOLDS FOR FACE RECOGNITION ACROSS ILLUMINATION, AGING AND QUANTIZATION

    Get PDF
    During the past many decades, many face recognition algorithms have been proposed. The face recognition problem under controlled environment has been well studied and almost solved. However, in unconstrained environments, the performance of face recognition methods could still be significantly affected by factors such as illumination, pose, resolution, occlusion, aging, etc. In this thesis, we look into the problem of face recognition across these variations and quantization. We present a face recognition algorithm based on simultaneous sparse approximations under varying illumination and pose with dictionaries learned for each class. A novel test image is projected onto the span of the atoms in each learned dictionary. The resulting residual vectors are then used for classification. An image relighting technique based on pose-robust albedo estimation is used to generate multiple frontal images of the same person with variable lighting. As a result, the proposed algorithm has the ability to recognize human faces with high accuracy even when only a single or a very few images per person are provided for training. The efficiency of the proposed method is demonstrated using publicly available databases and it is shown that this method is efficient and can perform significantly better than many competitive face recognition algorithms. The problem of recognizing facial images across aging remains an open problem. We look into this problem by studying the growth in the facial shapes. Building on recent advances in landmark extraction, and statistical techniques for landmark-based shape analysis, we show that using well-defined shape spaces and its associated geometry, one can obtain significant performance improvements in face verification. Toward this end, we propose to model the facial shapes as points on a Grassmann manifold. The face verification problem is then formulated as a classification problem on this manifold. We then propose a relative craniofacial growth model which is based on the science of craniofacial anthropometry and integrate it with the Grassmann manifold and the SVM classifier. Experiments show that the proposed method is able to mitigate the variations caused by the aging progress and thus effectively improve the performance of open-set face verification across aging. In applications such as document understanding, only binary face images may be available as inputs to a face recognition algorithm. We investigate the effects of quantization on several classical face recognition algorithms. We study the performances of PCA and multiple exemplar discriminant analysis (MEDA) algorithms with quantized images and with binary images modified by distance and Box-Cox transforms. We propose a dictionary-based method for reconstructing the grey scale facial images from the quantized facial images. Two dictionaries with low mutual coherence are learned for the grey scale and quantized training images respectively using a modified KSVD method. A linear transform function between the sparse vectors of quantized images and the sparse vectors of grey scale images is estimated using the training data. In the testing stage, a grey scale image is reconstructed from the quantized image using the transform matrix and normalized dictionaries. The identities of the reconstructed grey scale images are then determined using the dictionary-based face recognition (DFR) algorithm. Experimental results show that the reconstructed images are similar to the original grey-scale images and the performance of face recognition on the quantized images is comparable to the performance on grey scale images. The online social network and social media is growing rapidly. It is interesting to study the impact of social network on computer vision algorithms. We address the problem of automated face recognition on a social network using a loopy belief propagation framework. The proposed approach propagates the identities of faces in photos across social graphs. We characterize its performance in terms of structural properties of the given social network. We propose a distance metric defined using face recognition results for detecting hidden connections. The performance of the proposed method is analyzed on graph structure networks, scalability, different degrees of nodes, labeling errors correction and hidden connections discovery. The result demonstrates that the constraints imposed by the social network have the potential to improve the performance of face recognition methods. The result also shows it is possible to discover hidden connections in a social network based on face recognition

    Exploring sparsity, self-similarity, and low rank approximation in action recognition, motion retrieval, and action spotting

    Get PDF
    This thesis consists of 4 major parts. In the first part (Chapters 1-2), we introduce the overview, motivation, and contribution of our works, and extensively survey the current literature for 6 related topics. In the second part (Chapters 3-7), we explore the concept of Self-Similarity in two challenging scenarios, namely, the Action Recognition and the Motion Retrieval. We build three-dimensional volume representations for both scenarios, and devise effective techniques that can produce compact representations encoding the internal dynamics of data. In the third part (Chapter 8), we explore the challenging action spotting problem, and propose a feature-independent unsupervised framework that is effective in spotting action under various real situations, even under heavily perturbed conditions. The final part (Chapters 9) is dedicated to conclusions and future works. For action recognition, we introduce a generic method that does not depend on one particular type of input feature vector. We make three main contributions: (i) We introduce the concept of Joint Self-Similarity Volume (Joint SSV) for modeling dynamical systems, and show that by using a new optimized rank-1 tensor approximation of Joint SSV one can obtain compact low-dimensional descriptors that very accurately preserve the dynamics of the original system, e.g. an action video sequence; (ii) The descriptor vectors derived from the optimized rank-1 approximation make it possible to recognize actions without explicitly aligning the action sequences of varying speed of execution or difference frame rates; (iii) The method is generic and can be applied using different low-level features such as silhouettes, histogram of oriented gradients (HOG), etc. Hence, it does not necessarily require explicit tracking of features in the space-time volume. Our experimental results on five public datasets demonstrate that our method produces very good results and outperforms many baseline methods. For action recognition for incomplete videos, we determine whether incomplete videos that are often discarded carry useful information for action recognition, and if so, how one can represent such mixed collection of video data (complete versus incomplete, and labeled versus unlabeled) in a unified manner. We propose a novel framework to handle incomplete videos in action classification, and make three main contributions: (i) We cast the action classification problem for a mixture of complete and incomplete data as a semi-supervised learning problem of labeled and unlabeled data. (ii) We introduce a two-step approach to convert the input mixed data into a uniform compact representation. (iii) Exhaustively scrutinizing 280 configurations, we experimentally show on our two created benchmarks that, even when the videos are extremely sparse and incomplete, it is still possible to recover useful information from them, and classify unknown actions by a graph based semi-supervised learning framework. For motion retrieval, we present a framework that allows for a flexible and an efficient retrieval of motion capture data in huge databases. The method first converts an action sequence into a self-similarity matrix (SSM), which is based on the notion of self-similarity. This conversion of the motion sequences into compact and low-rank subspace representations greatly reduces the spatiotemporal dimensionality of the sequences. The SSMs are then used to construct order-3 tensors, and we propose a low-rank decomposition scheme that allows for converting the motion sequence volumes into compact lower dimensional representations, without losing the nonlinear dynamics of the motion manifold. Thus, unlike existing linear dimensionality reduction methods that distort the motion manifold and lose very critical and discriminative components, the proposed method performs well, even when inter-class differences are small or intra-class differences are large. In addition, the method allows for an efficient retrieval and does not require the time-alignment of the motion sequences. We evaluate the performance of our retrieval framework on the CMU mocap dataset under two experimental settings, both demonstrating very good retrieval rates. For action spotting, our framework does not depend on any specific feature (e.g. HOG/HOF, STIP, silhouette, bag-of-words, etc.), and requires no human localization, segmentation, or framewise tracking. This is achieved by treating the problem holistically as that of extracting the internal dynamics of video cuboids by modeling them in their natural form as multilinear tensors. To extract their internal dynamics, we devised a novel Two-Phase Decomposition (TP-Decomp) of a tensor that generates very compact and discriminative representations that are robust to even heavily perturbed data. Technically, a Rank-based Tensor Core Pyramid (Rank-TCP) descriptor is generated by combining multiple tensor cores under multiple ranks, allowing to represent video cuboids in a hierarchical tensor pyramid. The problem then reduces to a template matching problem, which is solved efficiently by using two boosting strategies: (i) to reduce the search space, we filter the dense trajectory cloud extracted from the target video; (ii) to boost the matching speed, we perform matching in an iterative coarse-to-fine manner. Experiments on 5 benchmarks show that our method outperforms current state-of-the-art under various challenging conditions. We also created a challenging dataset called Heavily Perturbed Video Arrays (HPVA) to validate the robustness of our framework under heavily perturbed situations

    Action analysis and video summarisation to efficiently manage and interpret video data

    Get PDF

    Deep representation learning for action recognition : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand

    Get PDF
    Figures 2.2 through 2.7, and 2.9 through 2.11 were removed for copyright reasons. Figures 2.8, and 2.12 through 2.16 are licensed on the arXiv repository under a Creative Commons Attribution licence (https://arxiv.org/help/license).This research focuses on deep representation learning for human action recognition based on the emerging deep learning techniques using RGB and skeleton data. The output of such deep learning techniques is a parameterised hierarchical model, representing the learnt knowledge from the training dataset. It is similar to the knowledge stored in our brain, which is learned from our experience. Currently, the computer’s ability to perform such abstraction is far behind human’s level, perhaps due to the complex processing of spatio-temporal knowledge. The discriminative spatio-temporal representation of human actions is the key for human action recognition systems. Different feature encoding approaches and different learning models may lead to quite different output performances, and at the present time there is no approach that can accurately model the cognitive processing for human actions. This thesis presents several novel approaches to allow computers to learn discriminative, compact and representative spatio-temporal features for human action recognition from multiple input features, aiming at enhancing the performance of an automated system for human action recognition. The input features for the proposed approaches in this thesis are derived from signals that are captured by the depth camera, e.g., RGB video and skeleton data. In this thesis, I developed several geometric features, and proposed the following models for action recognition: CVR-CNN, SKB-TCN, Multi-Stream CNN and STN. These proposed models are inspired by the visual attention mechanisms that are inherently present in human beings. In addition, I discussed the performance of the geometric features that I developed along with the proposed models. Superior experimental results for the proposed geometric features and models are obtained and verified on several benchmarking human action recognition datasets. In the case of the most challenging benchmarking dataset, NTU RGB+D, the accuracy of the results obtained surpassed the performance of the existing RNN-based and ST-GCN models. This study provides a deeper understanding of the spatio-temporal representation of human actions and it has significant implications to explain the inner workings of the deep learning models in learning patterns from time series data. The findings of these proposed models can set forth a solid foundation for further developments, and for the guidance of future human action-related studies

    Exploring geometrical structures in high-dimensional computer vision data

    Get PDF
    In computer vision, objects such as local features, images and video sequences are often represented as high dimensional data points, although it is commonly believed that there are low dimensional geometrical structures that underline the data set. The low dimensional geometric information enables us to have a better understanding of the high dimensional data sets and is useful in solving computer vision problems. In this thesis, the geometrical structures are investigated from different perspectives according to different computer vision applications. For spectral clustering, the distribution of data points in the local region is summarised by a covariance matrix which is viewed as the Mahalanobis distance. For the action recognition problem, we extract subspace information for each action class. The query video sequence is labeled by information regarding its distance to the subspaces of the corresponding video classes. Three new algorithms are introduced for hashing-based approaches for approximate nearest neighbour (ANN) search problems, NOKMeans relaxes the orthogonal condition of the encoding functions in previous quantisation error based methods by representing data points in a new feature space; Auto-JacoBin uses a robust auto-encoder model to preserve the geometric information from the original space into the binary codes; and AGreedy assigns a score, which reflects the ability to preserve the order information in the local regions, for any set of encoding functions and an alternating greedy method is used to find a local optimal solution. The geometric information has the potential to bring better solutions for computer vision problems. As shown in our experiments, the benefits include increasing clustering accuracy, reducing the computation for recognising actions in videos and increasing retrieval performance for ANN problems
    corecore