506 research outputs found

    MoWLD: a robust motion image descriptor for violence detection

    Full text link
    © 2015, Springer Science+Business Media New York. Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in designing an algorithm that can detect violence in surveillance videos with high performance. Existing methods typically apply the Bag-of-Words (BoW) model on local spatiotemporal descriptors. However, traditional spatiotemporal features are not discriminative enough, and also the BoW model roughly assigns each feature vector to only one visual word and therefore ignores the spatial relationships among the features. To tackle these problems, in this paper we propose a novel Motion Weber Local Descriptor (MoWLD) in the spirit of the well-known WLD and make it a powerful and robust descriptor for motion images. We extend the WLD spatial descriptions by adding a temporal component to the appearance descriptor, which implicitly captures local motion information as well as low-level image appear information. To eliminate redundant and irrelevant features, the non-parametric Kernel Density Estimation (KDE) is employed on the MoWLD descriptor. In order to obtain more discriminative features, we adopt the sparse coding and max pooling scheme to further process the selected MoWLDs. Experimental results on three benchmark datasets have demonstrated the superiority of the proposed approach over the state-of-the-arts

    Learned Spatio-Temporal Texture Descriptors for RGB-D Human Action Recognition

    Get PDF
    Due to the recent arrival of Kinect, action recognition with depth images has attracted researchers' wide attentions and various descriptors have been proposed, where Local Binary Patterns (LBP) texture descriptors possess the properties of appearance invariance. However, the LBP and its variants are most artificially-designed, demanding engineers' strong prior knowledge and not discriminative enough for recognition tasks. To this end, this paper develops compact spatio-temporal texture descriptors, i.e. 3D-compact LBP (3D-CLBP) and local depth patterns (3D-CLDP), for color and depth videos in the light of compact binary face descriptor learning in face recognition. Extensive experiments performed on three standard datasets, 3D Online Action, MSR Action Pairs and MSR Daily Activity 3D, demonstrate that our method is superior to most comparative methods in respects of performance and can capture spatial-temporal texture cues in videos

    Enhanced CNN for image denoising

    Full text link
    Owing to flexible architectures of deep convolutional neural networks (CNNs), CNNs are successfully used for image denoising. However, they suffer from the following drawbacks: (i) deep network architecture is very difficult to train. (ii) Deeper networks face the challenge of performance saturation. In this study, the authors propose a novel method called enhanced convolutional neural denoising network (ECNDNet). Specifically, they use residual learning and batch normalisation techniques to address the problem of training difficulties and accelerate the convergence of the network. In addition, dilated convolutions are used in the proposed network to enlarge the context information and reduce the computational cost. Extensive experiments demonstrate that the ECNDNet outperforms the state-of-the-art methods for image denoising.Comment: CAAI Transactions on Intelligence Technology[J], 201

    Convolutional Sparse Kernel Network for Unsupervised Medical Image Analysis

    Full text link
    The availability of large-scale annotated image datasets and recent advances in supervised deep learning methods enable the end-to-end derivation of representative image features that can impact a variety of image analysis problems. Such supervised approaches, however, are difficult to implement in the medical domain where large volumes of labelled data are difficult to obtain due to the complexity of manual annotation and inter- and intra-observer variability in label assignment. We propose a new convolutional sparse kernel network (CSKN), which is a hierarchical unsupervised feature learning framework that addresses the challenge of learning representative visual features in medical image analysis domains where there is a lack of annotated training data. Our framework has three contributions: (i) We extend kernel learning to identify and represent invariant features across image sub-patches in an unsupervised manner. (ii) We initialise our kernel learning with a layer-wise pre-training scheme that leverages the sparsity inherent in medical images to extract initial discriminative features. (iii) We adapt a multi-scale spatial pyramid pooling (SPP) framework to capture subtle geometric differences between learned visual features. We evaluated our framework in medical image retrieval and classification on three public datasets. Our results show that our CSKN had better accuracy when compared to other conventional unsupervised methods and comparable accuracy to methods that used state-of-the-art supervised convolutional neural networks (CNNs). Our findings indicate that our unsupervised CSKN provides an opportunity to leverage unannotated big data in medical imaging repositories.Comment: Accepted by Medical Image Analysis (with a new title 'Convolutional Sparse Kernel Network for Unsupervised Medical Image Analysis'). The manuscript is available from following link (https://doi.org/10.1016/j.media.2019.06.005

    REPRESENTATION LEARNING FOR ACTION RECOGNITION

    Get PDF
    The objective of this research work is to develop discriminative representations for human actions. The motivation stems from the fact that there are many issues encountered while capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration), inter-action similarity, background motion, and occlusion of actors. Hence, obtaining a representation which can address all the variations in the same action while maintaining discrimination with other actions is a challenging task. In literature, actions have been represented either using either low-level or high-level features. Low-level features describe the motion and appearance in small spatio-temporal volumes extracted from a video. Due to the limited space-time volume used for extracting low-level features, they are not able to account for viewpoint and actor variations or variable length actions. On the other hand, high-level features handle variations in actors, viewpoints, and duration but the resulting representation is often high-dimensional which introduces the curse of dimensionality. In this thesis, we propose new representations for describing actions by combining the advantages of both low-level and high-level features. Specifically, we investigate various linear and non-linear decomposition techniques to extract meaningful attributes in both high-level and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build action-specific dictionaries. Each dictionary retains only the discriminative information for a particular action and hence reduces inter-action similarity. Then, a sparsity-based classification method is proposed to classify the low-rank representation of clips obtained using these dictionaries. We show that this representation based on dictionary learning improves the classification performance across actions. Also, a few of the actions consist of rapid body deformations that hinder the extraction of local features from body movements. Hence, we propose to use a dictionary which is trained on convolutional neural network (CNN) features of the human body in various poses to reliably identify actors from the background. Particularly, we demonstrate the efficacy of sparse representation in the identification of the human body under rapid and substantial deformation. In the first two approaches, sparsity-based representation is developed to improve discriminability using class-specific dictionaries that utilize action labels. However, developing an unsupervised representation of actions is more beneficial as it can be used to both recognize similar actions and localize actions. We propose to exploit inter-action similarity to train a universal attribute model (UAM) in order to learn action attributes (common and distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation, a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains redundant attributes of all other actions, we use factor analysis to extract a novel lowvi dimensional action-vector representation for each clip. Action-vectors are shown to suppress background motion and highlight actions of interest in both trimmed and untrimmed clips that contributes to action recognition without the help of any classifiers. It is observed during our experiments that action-vector cannot effectively discriminate between actions which are visually similar to each other. Hence, we subject action-vectors to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary information across action-vectors using different local features followed by discriminative embedding provides the best classification performance. Further, we explore non-linear embedding of action-vectors using Siamese networks especially for fine-grained action recognition. A visualization of the hidden layer output in Siamese networks shows its ability to effectively separate visually similar actions. This leads to better classification performance than linear embedding on fine-grained action recognition. All of the above approaches are presented on large unconstrained datasets with hundreds of examples per action. However, actions in surveillance videos like snatch thefts are difficult to model because of the diverse variety of scenarios in which they occur and very few labeled examples. Hence, we propose to utilize the universal attribute model (UAM) trained on large action datasets to represent such actions. Specifically, we show that there are similarities between certain actions in the large datasets with snatch thefts which help in extracting a representation for snatch thefts using the attributes from the UAM. This representation is shown to be effective in distinguishing snatch thefts from regular actions with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing actions which provide better discrimination than existing representations. The first approach presents a dictionary learning based sparse representation for effective discrimination of actions. Also, we propose a sparse representation for the human body based on dictionaries in order to recognize actions with rapid body deformations. In the next approach, a low-dimensional representation called action-vector for unsupervised action recognition is presented. Further, linear and non-linear embedding of action-vectors is proposed for addressing inter-action similarity and fine-grained action recognition, respectively. Finally, we propose a representation for locating snatch thefts among thousands of regular interactions in surveillance videos

    Vision-based Person Re-identification in a Queue

    Get PDF

    MoWLD: A Robust Motion Image Descriptor for Violence Detection

    Get PDF
    Abstract Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in designing an algorithm that can detect violence in surveillance videos with high performance. Existing methods typically apply the Bagof-Words (BoW) model on local spatiotemporal descriptors. However, traditional spatiotemporal features are not discriminative enough, and also the BoW model roughly assigns each feature vector to only one visual word and therefore ignores the spatial relationships among the features. To tackle these problems, in this paper we propose a novel Motion Weber Local Descriptor (MoWLD) in the spirit of the well-known WLD and make it a powerful and robust descriptor for motion images. We extend the WLD spatial descriptions by adding a temporal component to the appearance descriptor, which implicitly captures local motion information as well as low-level image appear information. To eliminate redundant and irrelevant features, the nonparametric Kernel Density Estimation (KDE) is employed on the MoWLD descriptor. In order to obtain more discriminative features, we adopt the sparse coding and max pooling scheme to further process the selected MoWLDs. Experimental results on three benchmark datasets have demonstrated the superiority of the proposed approach over the state-of-the-arts

    Neonatal pain detection in videos using the iCOPEvid dataset and an ensemble of descriptors extracted from Gaussian of Local Descriptors

    Get PDF
    Diagnosing pain in neonates is difficult but critical. Although approximately thirty manual pain instruments have been developed for neonatal pain diagnosis, most are complex, multifactorial, and geared toward research. The goals of this work are twofold: 1) to develop a new video dataset for automatic neonatal pain detection called iCOPEvid (infant Classification Of Pain Expressions videos), and 2) to present a classification system that sets a challenging comparison performance on this dataset. The iCOPEvid dataset contains 234 videos of 49 neonates experiencing a set of noxious stimuli, a period of rest, and an acute pain stimulus. From these videos 20 s segments are extracted and grouped into two classes: pain (49) and nopain (185), with the nopain video segments handpicked to produce a highly challenging dataset. An ensemble of twelve global and local descriptors with a Bag-of-Features approach is utilized to improve the performance of some new descriptors based on Gaussian of Local Descriptors (GOLD). The basic classifier used in the ensembles is the Support Vector Machine, and decisions are combined by sum rule. These results are compared with standard methods, some deep learning approaches, and 185 human assessments. Our best machine learning methods are shown to outperform the human judges
    corecore