    Log-Euclidean Bag of Words for Human Action Recognition

    Representing videos by densely extracted local space-time features has recently become a popular approach for analysing actions. In this paper, we tackle the problem of categorising human actions by devising Bag of Words (BoW) models based on covariance matrices of spatio-temporal features, with the features formed from histograms of optical flow. Since covariance matrices form a special type of Riemannian manifold, the space of Symmetric Positive Definite (SPD) matrices, non-Euclidean geometry should be taken into account while discriminating between covariance matrices. To this end, we propose to embed SPD manifolds to Euclidean spaces via a diffeomorphism and extend the BoW approach to its Riemannian version. The proposed BoW approach takes into account the manifold geometry of SPD matrices during the generation of the codebook and histograms. Experiments on challenging human action datasets show that the proposed method obtains notable improvements in discrimination accuracy, in comparison to several state-of-the-art methods

    Sparse Coding on Symmetric Positive Definite Manifolds using Bregman Divergences

    This paper introduces sparse coding and dictionary learning for Symmetric Positive Definite (SPD) matrices, which are often used in machine learning, computer vision and related areas. Unlike traditional sparse coding schemes that work in vector spaces, in this paper we discuss how SPD matrices can be described by sparse combination of dictionary atoms, where the atoms are also SPD matrices. We propose to seek sparse coding by embedding the space of SPD matrices into Hilbert spaces through two types of Bregman matrix divergences. This not only leads to an efficient way of performing sparse coding, but also an online and iterative scheme for dictionary learning. We apply the proposed methods to several computer vision tasks where images are represented by region covariance matrices. Our proposed algorithms outperform state-of-the-art methods on a wide range of classification tasks, including face recognition, action recognition, material classification and texture categorization

    Representing visual appearance by video Brownian covariance descriptor for human action recognition

    Second-order Temporal Pooling for Action Recognition

    Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics. Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV

    Extrinsic methods for coding and dictionary learning on grassmann manifolds

    Sparsity-based representations have recently led to notable results in various visual recognition tasks. In a separate line of research, Riemannian manifolds have been shown useful for dealing with features and models that do not lie in Euclidean spaces. With the aim of building a bridge between the two realms, we address the problem of sparse coding and dictionary learning in Grassmann manifolds, i.e., the space of linear subspaces. To this end, we propose to embed Grassmann manifolds into the space of symmetric matrices by an isometric mapping. This in turn enables us to extend two sparse coding schemes to Grassmann manifolds. Furthermore, we propose an algorithm for learning a Grassmann dictionary, atom by atom. Lastly, to handle non-linearity in data, we extend the proposed Grassmann sparse coding and dictionary learning algorithms through embedding into higher dimensional Hilbert spaces. Experiments on several classification tasks (gender recognition, gesture classification, scene analysis, face recognition, action recognition and dynamic texture classification) show that the proposed approaches achieve considerable improvements in discrimination accuracy, in comparison to state-of-the-art methods such as kernelized Affine Hull Method and graph-embedding Grassmann discriminant analysis

    Estudio de caracterizadores visuales para la detección de obstaculos en vídeos de ski con cámara subjetiva

    Este trabajo se centra en el estudio de caracterizadores visuales para la identificación de objetos u obstáculos sencillos en vídeos de ski. Para ello se han utilizado técnicas de aprendizaje para desarrollar un prototipo software que se apoya en un conjunto de prueba creado expresamente para este estudio. A nuestro saber este tipo de técnicas no se habían aplicado antes a este campo, por lo que se ha tenido que crear una base de datos con imágenes tomadas en primera persona. Como resultado del proyecto se ha permitido comprobar que, para determinados caracterizadores, se obtienen buenos resultados llegando incluso al 90\% de precisión en el reconocimiento de las clases de objetos creadas. La memoria aborda un análisis del estado del arte, donde se resumen una serie de dispositivos de motorización de la actividad física (pulseras, smartwatches, la nube de aplicaciones que proporcionan servicios extendidos a éstos dispositivos...). El estado del arte también resume los principales artículos relacionados con este estudio y sobre los cuales se apoya, tanto en las técnicas de visión basadas en caracterizadores visuales como en las de aprendizaje. A continuación se presenta la arquitectura del sistema, con un resumen global, la descripción de los caracterizadores visuales SURF (Speed Up Robust Feature), HOG (Histogram of Oriented Gradients), HOF (Histogram of Optical Flow) y MBH (Motion Boundary Histogram) utilizados. Seguidamente, el prototipo del sistema presenta de una manera estructurada toda la implementación realizada. Finalmente, se mostrarán los resultados con su correspondiente evaluación y la gestión del proyecto con sus conclusiones

    Fast and accurate image and video analysis on Riemannian manifolds

    Human action recognition under Log-Euclidean Riemannian metric

    This paper presents a new action recognition approach based on local spatio-temporal features. The main contributions of our approach are twofold. First, a new local spatio-temporal feature is proposed to represent the cuboids detected in video sequences. Specifically, the descriptor utilizes the covariance matrix to capture the self-correlation information of the low-level features within each cuboid. Since covariance matrices do not lie on Euclidean space, the Log-Euclidean Riemannian metric is used for distance measure between covariance matrices. Second, the Earth Mover’s Distance (EMD) is used for matching any pair of video sequences. In contrast to the widely used Euclidean distance, EMD achieves more robust performances in matching histograms/distributions with different sizes. Experimental results on two datasets demonstrate the effectiveness of the proposed approach