165 research outputs found

    Second-order Temporal Pooling for Action Recognition

    Full text link
    Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics. Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV

    Discriminative Video Representation Learning

    Get PDF
    Representation learning is a fundamental research problem in the area of machine learning, refining the raw data to discover representations needed for various applications. However, real-world data, particularly video data, is neither mathematically nor computationally convenient to process due to its semantic redundancy and complexity. Video data, as opposed to images, includes temporal correlation and motion dynamics, but the ground truth label is normally limited to category labels, which makes the video representation learning a challenging problem. To this end, this thesis addresses the problem of video representation learning, specifically discriminative video representation learning, which focuses on capturing useful data distributions and reliable feature representations improving the performance of varied downstream tasks. We argue that neither all frames in one video nor all dimensions in one feature vector are useful and should be equally treated for video representation learning. Based on this argument, several novel algorithms are investigated in this thesis under multiple application scenarios, such as action recognition, action detection and one-class video anomaly detection. These proposed video representation learning methods produce discriminative video features in both deep and non-deep learning setups. Specifically, they are presented in the form of: 1) an early fusion layer that adopts a temporal ranking SVM formulation, agglomerating several optical flow images from consecutive frames into a novel compact representation, named as dynamic optical flow images; 2) an intermediate feature aggregation layer that applies weakly-supervised contrastive learning techniques, learning discriminative video representations via contrasting positive and negative samples from a sequence; 3) a new formulation for one-class feature learning that learns a set of discriminative subspaces with orthonormal hyperplanes to flexibly bound the one-class data distribution using Riemannian optimisation methods. We provide extensive experiments to gain intuitions into why the learned representations are discriminative and useful. All the proposed methods in this thesis are evaluated on standard publicly available benchmarks, demonstrating state-of-the-art performance

    Low Computational Cost Machine Learning: Random Projections and Polynomial Kernels

    Get PDF
    [EN] According to recent reports, over the course of 2018, the volume of data generated, captured and replicated globally was 33 Zettabytes (ZB), and it is expected to reach 175 ZB by the year 2025. Managing this impressive increase in the volume and variety of data represents a great challenge, but also provides organizations with a precious opportunity to support their decision-making processes with insights and knowledge extracted from massive collections of data and to automate tasks leading to important savings. In this context, the field of machine learning has attracted a notable level of attention, and recent breakthroughs in the area have enabled the creation of predictive models of unprecedented accuracy. However, with the emergence of new computational paradigms, the field is now faced with the challenge of creating more efficient models, capable of running on low computational power environments while maintaining a high level of accuracy. This thesis focuses on the design and evaluation of new algorithms for the generation of useful data representations, with special attention to the scalability and efficiency of the proposed solutions. In particular, the proposed methods make an intensive use of randomization in order to map data samples to the feature spaces of polynomial kernels and then condensate the useful information present in those feature spaces into a more compact representation. The resulting algorithmic designs are easy to implement and require little computational power to run. As a consequence, they are perfectly suited for applications in environments where computational resources are scarce and data needs to be analyzed with little delay. The two major contributions of this thesis are: (1) we present and evaluate efficient and data-independent algorithms that perform Random Projections from the feature spaces of polynomial kernels of different degrees and (2) we demonstrate how these techniques can be used to accelerate machine learning tasks where polynomial interaction features are used, focusing particularly on bilinear models in deep learning
    corecore