2,590 research outputs found

    Gaussian mixture model classifiers for detection and tracking in UAV video streams.

    Get PDF
    Masters Degree. University of KwaZulu-Natal, Durban.Manual visual surveillance systems are subject to a high degree of human-error and operator fatigue. The automation of such systems often employs detectors, trackers and classifiers as fundamental building blocks. Detection, tracking and classification are especially useful and challenging in Unmanned Aerial Vehicle (UAV) based surveillance systems. Previous solutions have addressed challenges via complex classification methods. This dissertation proposes less complex Gaussian Mixture Model (GMM) based classifiers that can simplify the process; where data is represented as a reduced set of model parameters, and classification is performed in the low dimensionality parameter-space. The specification and adoption of GMM based classifiers on the UAV visual tracking feature space formed the principal contribution of the work. This methodology can be generalised to other feature spaces. This dissertation presents two main contributions in the form of submissions to ISI accredited journals. In the first paper, objectives are demonstrated with a vehicle detector incorporating a two stage GMM classifier, applied to a single feature space, namely Histogram of Oriented Gradients (HoG). While the second paper demonstrates objectives with a vehicle tracker using colour histograms (in RGB and HSV), with Gaussian Mixture Model (GMM) classifiers and a Kalman filter. The proposed works are comparable to related works with testing performed on benchmark datasets. In the tracking domain for such platforms, tracking alone is insufficient. Adaptive detection and classification can assist in search space reduction, building of knowledge priors and improved target representations. Results show that the proposed approach improves performance and robustness. Findings also indicate potential further enhancements such as a multi-mode tracker with global and local tracking based on a combination of both papers

    Action Recognition Using Particle Flow Fields

    Get PDF
    In recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition datasets, but fail to produce similar results in more complex, large-scale datasets. Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (six actions), IXMAS (thirteen actions), and Weizmann (ten actions). Challenges such as camera motion, different viewpoints, huge interclass variations, cluttered background, occlusions, bad illumination conditions, and poor quality of web videos cause the majority of the state-of-the-art action recognition approaches to fail. An increasing number of categories and the inclusion of actions with high confusion also increase the difficulty of the problem. The approach taken to solve this action recognition problem depends primarily on the dataset and the possibility of detecting and tracking the object of interest. In this dissertation, a new method for video representation is proposed and three new approaches to perform action recognition in different scenarios using varying prerequisites are presented. The prerequisites have decreasing levels of difficulty to obtain: 1) Scenario requires human detection and trackiii ing to perform action recognition; 2) Scenario requires background and foreground separation to perform action recognition; and 3) No pre-processing is required for action recognition. First, we propose a new video representation using optical flow and particle advection. The proposed “Particle Flow Field” (PFF) representation has been used to generate motion descriptors and tested in a Bag of Video Words (BoVW) framework on the KTH dataset. We show that particle flow fields has better performance than other low-level video representations, such as 2D-Gradients, 3D-Gradients and optical flow. Second, we analyze the performance of the state-of-the-art technique based on the histogram of oriented 3D-Gradients in spatio temporal volumes, where human detection and tracking are required. We use the proposed particle flow field and show superior results compared to the histogram of oriented 3D-Gradients in spatio temporal volumes. The proposed method, when used for human action recognition, just needs human detection and does not necessarily require human tracking and figure centric bounding boxes. It has been tested on KTH (six actions), Weizmann (ten actions), and IXMAS (thirteen actions, 4 different views) action recognition datasets. Third, we propose using the scene context information obtained from moving and stationary pixels in the key frames, in conjunction with motion descriptors obtained using Bag of Words framework, to solve the action recognition problem on a large (50 actions) dataset with videos from the web. We perform a combination of early and late fusion on multiple features to handle the huge number of categories. We demonstrate that scene context is a very important feature for performing action recognition on huge datasets. iv The proposed method needs separation of moving and stationary pixels, and does not require any kind of video stabilization, person detection, or tracking and pruning of features. Our approach obtains good performance on a huge number of action categories. It has been tested on the UCF50 dataset with 50 action categories, which is an extension of the UCF YouTube Action (UCF11) Dataset containing 11 action categories. We also tested our approach on the KTH and HMDB51 datasets for comparison. Finally, we focus on solving practice problems in representing actions by bag of spatio temporal features (i.e. cuboids), which has proven valuable for action recognition in recent literature. We observed that the visual vocabulary based (bag of video words) method suffers from many drawbacks in practice, such as: (i) It requires an intensive training stage to obtain good performance; (ii) it is sensitive to the vocabulary size; (iii) it is unable to cope with incremental recognition problems; (iv) it is unable to recognize simultaneous multiple actions; (v) it is unable to perform recognition frame by frame. In order to overcome these drawbacks, we propose a framework to index large scale motion features using Sphere/Rectangle-tree (SR-tree) for incremental action detection and recognition. The recognition comprises of the following two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), and 2) using a simple voting strategy to label the action. It can also provide localization of the action. Since it does not require feature quantization it can efficiently grow the feature-tree by adding features from new training actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets because the SR-tree is a disk-based v data structure. We tested our approach on two publicly available datasets, the KTH dataset and the IXMAS multi-view dataset, and achieved promising results

    Fast and Accurate Algorithm for Eye Localization for Gaze Tracking in Low Resolution Images

    Full text link
    Iris centre localization in low-resolution visible images is a challenging problem in computer vision community due to noise, shadows, occlusions, pose variations, eye blinks, etc. This paper proposes an efficient method for determining iris centre in low-resolution images in the visible spectrum. Even low-cost consumer-grade webcams can be used for gaze tracking without any additional hardware. A two-stage algorithm is proposed for iris centre localization. The proposed method uses geometrical characteristics of the eye. In the first stage, a fast convolution based approach is used for obtaining the coarse location of iris centre (IC). The IC location is further refined in the second stage using boundary tracing and ellipse fitting. The algorithm has been evaluated in public databases like BioID, Gi4E and is found to outperform the state of the art methods.Comment: 12 pages, 10 figures, IET Computer Vision, 201

    Modeling geometric-temporal context with directional pyramid co-occurrence for action recognition

    Get PDF
    In this paper, we present a new geometric-temporal representation for visual action recognition based on local spatio-temporal features. First, we propose a modified covariance descriptor under the log-Euclidean Riemannian metric to represent the spatio-temporal cuboids detected in the video sequences. Compared with previously proposed covariance descriptors, our descriptor can be measured and clustered in Euclidian space. Second, to capture the geometric-temporal contextual information, we construct a directional pyramid co-occurrence matrix (DPCM) to describe the spatio-temporal distribution of the vector-quantized local feature descriptors extracted from a video. DPCM characterizes the co-occurrence statistics of local features as well as the spatio-temporal positional relationships among the concurrent features. These statistics provide strong descriptive power for action recognition. To use DPCM for action recognition, we propose a directional pyramid co-occurrence matching kernel to measure the similarity of videos. The proposed method achieves the state-of-the-art performance and improves on the recognition performance of the bag-of-visual-words (BOVWs) models by a large margin on six public data sets. For example, on the KTH data set, it achieves 98.78% accuracy while the BOVW approach only achieves 88.06%. On both Weizmann and UCF CIL data sets, the highest possible accuracy of 100% is achieved
    • …
    corecore