651 research outputs found

    Human action recognition with MPEG-7 descriptors and architectures

    Full text link
    Modern video surveillance requires addressing high-level concepts such as humans' actions and activities. In addition, surveillance applications need to be portable over a variety of platforms, from servers to mobile devices. In this paper, we explore the potential of the MPEG-7 standard to provide interfaces, descriptors, and architectures for human action recognition from surveillance cameras. Two novel MPEG-7 descriptors, symbolic and feature-based, are presented alongside two different architectures, server-intensive and client-intensive. The descriptors and architectures are evaluated in the paper by way of a scenario analysis

    Activity Recognition based on a Magnitude-Orientation Stream Network

    Full text link
    The temporal component of videos provides an important clue for activity recognition, as a number of activities can be reliably recognized based on the motion information. In view of that, this work proposes a novel temporal stream for two-stream convolutional networks based on images computed from the optical flow magnitude and orientation, named Magnitude-Orientation Stream (MOS), to learn the motion in a better and richer manner. Our method applies simple nonlinear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. Experimental results, carried on two well-known datasets (HMDB51 and UCF101), demonstrate that using our proposed temporal stream as input to existing neural network architectures can improve their performance for activity recognition. Results demonstrate that our temporal stream provides complementary information able to improve the classical two-stream methods, indicating the suitability of our approach to be used as a temporal video representation.Comment: 8 pages, SIBGRAPI 201

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    Towards practical automated human action recognition

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Modern video surveillance requires addressing high-level concepts such as humans' actions and activities. Automated human action recognition is an interesting research area, as well as one of the main trends in the automated video survei1lance industry. The typical goal of action recognition is that of labelling an image sequence (video) using one out of a set of action labels. In general, it requires the extraction of a feature set from the relevant video, fo1lowed by the classification of the extracted features. Despite the many approaches for feature set extraction and classification proposed to date, some barriers for practical action recognition sti11 exist. We argue that recognition accuracy, speed, robustness and the required hardware are the main factors to build a practical human action recognition system to be run on a typical PC for a real-time video surveillance application. For example, a computationally-heavy set of measurements may prevent practical implementation on common platforms. The main focus of this thesis is challenging the main difficulties and proposing solution. towards a practical action recognition system. The main outstanding difficulties that we have challenged in this thesis include 1) initialisation issues with model training: 2) feature sets of limited computational weight sui table for real-ti me application; 3) model robustness to outliers; and 4) pending issues with the standardisation of software interfaces. In the following, we provide a description of our contributions to the resolution of these issues. Amongst the different classification approaches for classifying action , graphical model such as the hidden Markov model (HMM) have been widely exploited by many researchers. Such models include observation probabilities which are generally modelled by mixtures of Gaussian components. When learning an HMM by way of Expectation-Maximisation (EM) algorithms, arbitrary choices must be made for their initial parameters. The initial choices have a major impact on the parameters at convergence and, in turn, on the recognition accuracy. This dependence forces us to repeat training with different initial parameters until satisfactory cross-validation accuracy is attained. Such a process is overall empirical and time consuming. We argue that one-off initialisation can offer a better trade-off between training time and accuracy, and as one of the main contributions of this thesis, we propose two methods for deterministic initialisation of the Gaussian components' centres. The first method is a time segmentation-based approach which divides each training sequence into the requested number of clusters (product of the number of HMM states and the number of Gaussian components in each state) in the time domain. Then, clusters' centres are averaged among all the training sequences to compute the initial centre for each Gaussian component. The second approach is a histogram-based approach which tries to initialise the components' centres with the more popular values among the training data in terms of density (similar to mode seeking approaches). The histogram-based approach is performed incrementally, considering each feature at a time. Either centre initialisation approach is followed by dispatching the resulting Gaussian components onto HMM states. The reference component dispatching method exploits the arbitrary order for dispatching. In contrast, we again propose two more intelligent methods based on the effort to put components with closer centres in the same state which can improve the co1Tect recognition rate. Experiments over three human action video datasets (Weizmann [1 ], MuHAVi [2] and Hollywood [3]) prove that our proposed deterministic initialisation methods are capable of achieving accuracy above the average of repeated random initialisations (about 1 per cent to 3 per cent in 6 random run experiment) and comparable to the best. At the same time, one-off deterministic initialisation can save the required training time substantially compared to repeated random initialisations, e.g. up to 83% in the case of 6 runs of random initialisation. The proposed methods are general as they naturally extend to other models where observation densities are conditioned on discrete latent variables, such as dynamic Bayesian networks (DBNs) and switching models . As another contribution, we propose a simple and computationally lightweight feature set, named sectorial extreme points, which requires only 1.6 ms per frame for extraction on a reference PC. We believe a lightweight feature set is more appropriate for the task of action recognition in real-time surveillance applications with the usual requirement of processing 25 frames per second (PAL video rate). The proposed feature set represents the coordinates of the extreme points in the contour of a subject's foreground mask. The various experiments prove the strength of the proposed feature set in terms of classification accuracy, compared to similar feature sets, such as the star skeleton [4] (by more than 3%) and the well-known projection histograms (up to 7%). Another main issue in density modelling of the extracted features is the outlier problem. The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is short-tailed and highly sensitive to outliers. Hence, outliers can affect the classification accuracy of the HMM-based action recognition approaches that exploit Gaussian distribution as the base component. In contrast, the Student' s t-distribution is more robust to outliers thanks to its longer tail and can be exploited for density modelling to improve the recognition rate in the presence of abnormal data. As another main contribution, we present an HMM which uses mixtures of t-distributions as observation probabilities and apply it for the recognition task. The conducted experiments over the Weizmann and MuHAVi datasets with various feature sets report a remarkable improvement of up to 9% in classification accuracy by using HMM with mixtures of t-distributions instead of mixture of Gaussians. Using our own proposed sectorial extreme points feature set, we have achieved the maximum possible classification accuracy (100%) over the Weizmann dataset. This achievement should be considered jointly with the fact that we have used a lightweight feature set. On a different ground, and from the implementation viewpoint, surveillance software for automated human action recognition requires portability over a variety of platforms, from servers to mobile devices. The current products mainly target low level video analysis tasks, e.g. video annotation, instead of higher level ones, such as action recognition. Therefore, we explore the potential of the MPEG-7 standard to provide a standard interface platform (through descriptors and architectures) for human action recognition from surveillance cameras. As the last contribution of this work, we present two novel MPEG-7 descriptors, one symbolic and the other feature-based, alongside two different architectures: the server-intensive which is more suitable for "thin" client devices , such as PDAs and the client-intensive that is more appropriate for ''thick" clients, such as desktops. We evaluate the proposed descriptors and architectures by way of a scenario analysis. We believe that through the four contributions of this thesis, human action recognition systems have become more practical. While some contributions are specific to generative models such as the HMM, other contributions are more general and can be exploited with other classification approaches. We acknowledge that the entire area of human action recognition is progressing at an enormous pace, and that other outstanding issues are being resolved by research groups world-wide. We hope that the reader will enjoy the content of this work

    Advanced content-based semantic scene analysis and information retrieval: the SCHEMA project

    Get PDF
    The aim of the SCHEMA Network of Excellence is to bring together a critical mass of universities, research centers, industrial partners and end users, in order to design a reference system for content-based semantic scene analysis, interpretation and understanding. Relevant research areas include: content-based multimedia analysis and automatic annotation of semantic multimedia content, combined textual and multimedia information retrieval, semantic -web, MPEG-7 and MPEG-21 standards, user interfaces and human factors. In this paper, recent advances in content-based analysis, indexing and retrieval of digital media within the SCHEMA Network are presented. These advances will be integrated in the SCHEMA module-based, expandable reference system

    Group Invariant Deep Representations for Image Instance Retrieval

    Get PDF
    Most image instance retrieval pipelines are based on comparison of vectors known as global image descriptors between a query image and the database images. Due to their success in large scale image classification, representations extracted from Convolutional Neural Networks (CNN) are quickly gaining ground on Fisher Vectors (FVs) as state-of-the-art global descriptors for image instance retrieval. While CNN-based descriptors are generally remarked for good retrieval performance at lower bitrates, they nevertheless present a number of drawbacks including the lack of robustness to common object transformations such as rotations compared with their interest point based FV counterparts. In this paper, we propose a method for computing invariant global descriptors from CNNs. Our method implements a recently proposed mathematical theory for invariance in a sensory cortex modeled as a feedforward neural network. The resulting global descriptors can be made invariant to multiple arbitrary transformation groups while retaining good discriminativeness. Based on a thorough empirical evaluation using several publicly available datasets, we show that our method is able to significantly and consistently improve retrieval results every time a new type of invariance is incorporated. We also show that our method which has few parameters is not prone to overfitting: improvements generalize well across datasets with different properties with regard to invariances. Finally, we show that our descriptors are able to compare favourably to other state-of-the-art compact descriptors in similar bitranges, exceeding the highest retrieval results reported in the literature on some datasets. A dedicated dimensionality reduction step --quantization or hashing-- may be able to further improve the competitiveness of the descriptors

    Magnitude-Orientation Stream Network and Depth Information applied to Activity Recognition

    Get PDF
    International audienceThe temporal component of videos provides an important clue for activity recognition , as a number of activities can be reliably recognized based on the motion information. In view of that, this work proposes a novel temporal stream for two-stream convolutional networks based on images computed from the optical flow magnitude and orientation, named Magnitude-Orientation Stream (MOS), to learn the motion in a better and richer manner. Our method applies simple non-linear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. Moreover, we also employ depth information to use as a weighting scheme on the magnitude information to compensate the distance of the subjects performing the activity to the camera. Experimental results, carried on two well-known datasets (UCF101 and NTU), demonstrate that using our proposed temporal stream as input to existing neural network architectures can improve their performance for activity recognition. Results demonstrate that our temporal stream provides complementary information able to improve the classical two-stream methods, indicating the suitability of our approach to be used as a temporal video representation. two-stream convolutional networks, spatiotemporal information, optical flow, depth information

    Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor

    Get PDF
    We investigate video classification via a two-stream convolutional neural network (CNN) design that directly ingests information extracted from compressed video bitstreams. Our approach begins with the observation that all modern video codecs divide the input frames into macroblocks (MBs). We demonstrate that selective access to MB motion vector (MV) information within compressed video bitstreams can also provide for selective, motion-adaptive, MB pixel decoding (a.k.a., MB texture decoding). This in turn allows for the derivation of spatio-temporal video activity regions at extremely high speed in comparison to conventional full-frame decoding followed by optical flow estimation. In order to evaluate the accuracy of a video classification framework based on such activity data, we independently train two CNN architectures on MB texture and MV correspondences and then fuse their scores to derive the final classification of each test video. Evaluation on two standard datasets shows that the proposed approach is competitive to the best two-stream video classification approaches found in the literature. At the same time: (i) a CPU-based realization of our MV extraction is over 977 times faster than GPU-based optical flow methods; (ii) selective decoding is up to 12 times faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs perform inference at 5 to 49 times lower cloud computing cost than the fastest methods from the literature.Comment: Accepted in IEEE Transactions on Circuits and Systems for Video Technology. Extension of ICIP 2017 conference pape

    Compressed Video Action Recognition

    Full text link
    Training robust deep video representations has proven to be much more challenging than learning deep image representations. This is in part due to the enormous size of raw video streams and the high temporal redundancy; the true and interesting signal is often drowned in too much irrelevant data. Motivated by that the superfluous information can be reduced by up to two orders of magnitude by video compression (using H.264, HEVC, etc.), we propose to train a deep network directly on the compressed video. This representation has a higher information density, and we found the training to be easier. In addition, the signals in a compressed video provide free, albeit noisy, motion information. We propose novel techniques to use them effectively. Our approach is about 4.6 times faster than Res3D and 2.7 times faster than ResNet-152. On the task of action recognition, our approach outperforms all the other methods on the UCF-101, HMDB-51, and Charades dataset.Comment: CVPR 2018 (Selected for spotlight presentation
    • 

    corecore