20,762 research outputs found

    A Study on Unsupervised Dictionary Learning and Feature Encoding for Action Classification

    Full text link
    Many efforts have been devoted to develop alternative methods to traditional vector quantization in image domain such as sparse coding and soft-assignment. These approaches can be split into a dictionary learning phase and a feature encoding phase which are often closely connected. In this paper, we investigate the effects of these phases by separating them for video-based action classification. We compare several dictionary learning methods and feature encoding schemes through extensive experiments on KTH and HMDB51 datasets. Experimental results indicate that sparse coding performs consistently better than the other encoding methods in large complex dataset (i.e., HMDB51), and it is robust to different dictionaries. For small simple dataset (i.e., KTH) with less variation, however, all the encoding strategies perform competitively. In addition, we note that the strength of sophisticated encoding approaches comes not from their corresponding dictionaries but the encoding mechanisms, and we can just use randomly selected exemplars as dictionaries for video-based action classification

    Kernel Coding: General Formulation and Special Cases

    Full text link
    Representing images by compact codes has proven beneficial for many visual recognition tasks. Most existing techniques, however, perform this coding step directly in image feature space, where the distributions of the different classes are typically entangled. In contrast, here, we study the problem of performing coding in a high-dimensional Hilbert space, where the classes are expected to be more easily separable. To this end, we introduce a general coding formulation that englobes the most popular techniques, such as bag of words, sparse coding and locality-based coding, and show how this formulation and its special cases can be kernelized. Importantly, we address several aspects of learning in our general formulation, such as kernel learning, dictionary learning and supervised kernel coding. Our experimental evaluation on several visual recognition tasks demonstrates the benefits of performing coding in Hilbert space, and in particular of jointly learning the kernel, the dictionary and the classifier

    Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice

    Full text link
    Video based action recognition is one of the important and challenging problems in computer vision research. Bag of Visual Words model (BoVW) with local features has become the most popular method and obtained the state-of-the-art performance on several realistic datasets, such as the HMDB51, UCF50, and UCF101. BoVW is a general pipeline to construct a global representation from a set of local features, which is mainly composed of five steps: (i) feature extraction, (ii) feature pre-processing, (iii) codebook generation, (iv) feature encoding, and (v) pooling and normalization. Many efforts have been made in each step independently in different scenarios and their effect on action recognition is still unknown. Meanwhile, video data exhibits different views of visual pattern, such as static appearance and motion dynamics. Multiple descriptors are usually extracted to represent these different views. Many feature fusion methods have been developed in other areas and their influence on action recognition has never been investigated before. This paper aims to provide a comprehensive study of all steps in BoVW and different fusion methods, and uncover some good practice to produce a state-of-the-art action recognition system. Specifically, we explore two kinds of local features, ten kinds of encoding methods, eight kinds of pooling and normalization strategies, and three kinds of fusion methods. We conclude that every step is crucial for contributing to the final recognition rate. Furthermore, based on our comprehensive study, we propose a simple yet effective representation, called hybrid representation, by exploring the complementarity of different BoVW frameworks and local descriptors. Using this representation, we obtain the state-of-the-art on the three challenging datasets: HMDB51 (61.1%), UCF50 (92.3%), and UCF101 (87.9%)

    Online Unsupervised Feature Learning for Visual Tracking

    Full text link
    Feature encoding with respect to an over-complete dictionary learned by unsupervised methods, followed by spatial pyramid pooling, and linear classification, has exhibited powerful strength in various vision applications. Here we propose to use the feature learning pipeline for visual tracking. Tracking is implemented using tracking-by-detection and the resulted framework is very simple yet effective. First, online dictionary learning is used to build a dictionary, which captures the appearance changes of the tracking target as well as the background changes. Given a test image window, we extract local image patches from it and each local patch is encoded with respect to the dictionary. The encoded features are then pooled over a spatial pyramid to form an aggregated feature vector. Finally, a simple linear classifier is trained on these features. Our experiments show that the proposed powerful---albeit simple---tracker, outperforms all the state-of-the-art tracking methods that we have tested. Moreover, we evaluate the performance of different dictionary learning and feature encoding methods in the proposed tracking framework, and analyse the impact of each component in the tracking scenario. We also demonstrate the flexibility of feature learning by plugging it into Hare et al.'s tracking method. The outcome is, to our knowledge, the best tracker ever reported, which facilitates the advantages of both feature learning and structured output prediction.Comment: 11 page

    Bag of Attributes for Video Event Retrieval

    Full text link
    In this paper, we present the Bag-of-Attributes (BoA) model for video representation aiming at video event retrieval. The BoA model is based on a semantic feature space for representing videos, resulting in high-level video feature vectors. For creating a semantic space, i.e., the attribute space, we can train a classifier using a labeled image dataset, obtaining a classification model that can be understood as a high-level codebook. This model is used to map low-level frame vectors into high-level vectors (e.g., classifier probability scores). Then, we apply pooling operations on the frame vectors to create the final bag of attributes for the video. In the BoA representation, each dimension corresponds to one category (or attribute) of the semantic space. Other interesting properties are: compactness, flexibility regarding the classifier, and ability to encode multiple semantic concepts in a single video representation. Our experiments considered the semantic space created by a deep convolutional neural network (OverFeat) pre-trained on 1000 object categories of ImageNet. OverFeat was then used to classify each video frame and max pooling combined the frame vectors in the BoA representation for the video. Results using BoA outperformed the baselines with statistical significance in the task of video event retrieval using the EVVE dataset

    Are visual dictionaries generalizable?

    Full text link
    Mid-level features based on visual dictionaries are today a cornerstone of systems for classification and retrieval of images. Those state-of-the-art representations depend crucially on the choice of a codebook (visual dictionary), which is usually derived from the dataset. In general-purpose, dynamic image collections (e.g., the Web), one cannot have the entire collection in order to extract a representative dictionary. However, based on the hypothesis that the dictionary reflects only the diversity of low-level appearances and does not capture semantics, we argue that a dictionary based on a small subset of the data, or even on an entirely different dataset, is able to produce a good representation, provided that the chosen images span a diverse enough portion of the low-level feature space. Our experiments confirm that hypothesis, opening the opportunity to greatly alleviate the burden in generating the codebook, and confirming the feasibility of employing visual dictionaries in large-scale dynamic environments

    Local Similarities, Global Coding: An Algorithm for Feature Coding and its Applications

    Full text link
    Data coding as a building block of several image processing algorithms has been received great attention recently. Indeed, the importance of the locality assumption in coding approaches is studied in numerous works and several methods are proposed based on this concept. We probe this assumption and claim that taking the similarity between a data point and a more global set of anchor points does not necessarily weaken the coding method as long as the underlying structure of the anchor points are taken into account. Based on this fact, we propose to capture this underlying structure by assuming a random walker over the anchor points. We show that our method is a fast approximate learning algorithm based on the diffusion map kernel. The experiments on various datasets show that making different state-of-the-art coding algorithms aware of this structure boosts them in different learning tasks

    Crowd Counting via Weighted VLAD on Dense Attribute Feature Maps

    Full text link
    Crowd counting is an important task in computer vision, which has many applications in video surveillance. Although the regression-based framework has achieved great improvements for crowd counting, how to improve the discriminative power of image representation is still an open problem. Conventional holistic features used in crowd counting often fail to capture semantic attributes and spatial cues of the image. In this paper, we propose integrating semantic information into learning locality-aware feature sets for accurate crowd counting. First, with the help of convolutional neural network (CNN), the original pixel space is mapped onto a dense attribute feature map, where each dimension of the pixel-wise feature indicates the probabilistic strength of a certain semantic class. Then, locality-aware features (LAF) built on the idea of spatial pyramids on neighboring patches are proposed to explore more spatial context and local information. Finally, the traditional VLAD encoding method is extended to a more generalized form in which diverse coefficient weights are taken into consideration. Experimental results validate the effectiveness of our presented method.Comment: 10 page

    Generic Image Classification Approaches Excel on Face Recognition

    Full text link
    The main finding of this work is that the standard image classification pipeline, which consists of dictionary learning, feature encoding, spatial pyramid pooling and linear classification, outperforms all state-of-the-art face recognition methods on the tested benchmark datasets (we have tested on AR, Extended Yale B, the challenging FERET, and LFW-a datasets). This surprising and prominent result suggests that those advances in generic image classification can be directly applied to improve face recognition systems. In other words, face recognition may not need to be viewed as a separate object classification problem. While recently a large body of residual based face recognition methods focus on developing complex dictionary learning algorithms, in this work we show that a dictionary of randomly extracted patches (even from non-face images) can achieve very promising results using the image classification pipeline. That means, the choice of dictionary learning methods may not be important. Instead, we find that learning multiple dictionaries using different low-level image features often improve the final classification accuracy. Our proposed face recognition approach offers the best reported results on the widely-used face recognition benchmark datasets. In particular, on the challenging FERET and LFW-a datasets, we improve the best reported accuracies in the literature by about 20% and 30% respectively.Comment: 10 page

    A Bag-of-Words Equivalent Recurrent Neural Network for Action Recognition

    Full text link
    The traditional bag-of-words approach has found a wide range of applications in computer vision. The standard pipeline consists of a generation of a visual vocabulary, a quantization of the features into histograms of visual words, and a classification step for which usually a support vector machine in combination with a non-linear kernel is used. Given large amounts of data, however, the model suffers from a lack of discriminative power. This applies particularly for action recognition, where the vast amount of video features needs to be subsampled for unsupervised visual vocabulary generation. Moreover, the kernel computation can be very expensive on large datasets. In this work, we propose a recurrent neural network that is equivalent to the traditional bag-of-words approach but enables for the application of discriminative training. The model further allows to incorporate the kernel computation into the neural network directly, solving the complexity issue and allowing to represent the complete classification system within a single network. We evaluate our method on four recent action recognition benchmarks and show that the conventional model as well as sparse coding methods are outperformed
    • …
    corecore