18 research outputs found

    Detecting and tracking people in real-time

    Get PDF
    The problem of detecting and tracking people in images and video has been the subject of a great deal of research, but remains a challenging task. Being able to detect and track people would have an impact in a number of fields, such as driverless vehicles, automated surveillance, and human-computer interaction. The difficulties that must be overcome include coping with variations in appearance between different people, changes in lighting, and the ability to detect people across multiple scales. As well as having high accuracy, it is desirable for a technique to evaluate an image with low latency between receiving the image and producing a result. This thesis explores methods for detecting and tracking people in images and video. Techniques are implemented on a desktop computer, with an emphasis on low latency. The problem of detection is examined first. The well established integral channel features detector is introduced and reimplemented, and various novelties are implemented in regards to the features used by the detector. Results are given to quantify the accuracy and the speed of the developed detectors on the INRIA person dataset. The method is further extended by examining the prospect of using multiple classifiers in conjunction. It is shown that using a classifier with a version of the same classifier reflected in the vertical axis can improve performance. A novel method for clustering images of people to find modes of appearance is also presented. This involves using boosting classifiers to map a set of images to vectors, to which K-means clustering is applied. Boosting classifiers are then trained on these clustered datasets to create sets of multiple classifiers, and it is demonstrated that these sets of classifiers can be evaluated on images with only a small increase in the running time over single classifiers. The problem of single target tracking is addressed using the mean shift algorithm. Mean shift tracking works by finding the best colour match for a target from frame to frame. A novel form of mean shift tracking through scale is developed, and the problem of multiple target tracking is addressed by using boosting classifiers in conjunction with Kalman filters. Tests are carried out on the CAVIAR dataset, which gives representative examples of surveillance scenarios, to show the performance of the proposed approaches.Open Acces

    Vulnerable road users and connected autonomous vehicles interaction: a survey

    Get PDF
    There is a group of users within the vehicular traffic ecosystem known as Vulnerable Road Users (VRUs). VRUs include pedestrians, cyclists, motorcyclists, among others. On the other hand, connected autonomous vehicles (CAVs) are a set of technologies that combines, on the one hand, communication technologies to stay always ubiquitous connected, and on the other hand, automated technologies to assist or replace the human driver during the driving process. Autonomous vehicles are being visualized as a viable alternative to solve road accidents providing a general safe environment for all the users on the road specifically to the most vulnerable. One of the problems facing autonomous vehicles is to generate mechanisms that facilitate their integration not only within the mobility environment, but also into the road society in a safe and efficient way. In this paper, we analyze and discuss how this integration can take place, reviewing the work that has been developed in recent years in each of the stages of the vehicle-human interaction, analyzing the challenges of vulnerable users and proposing solutions that contribute to solving these challenges.This work was partially funded by the Ministry of Economy, Industry, and Competitiveness of Spain under Grant: Supervision of drone fleet and optimization of commercial operations flight plans, PID2020-116377RB-C21.Peer ReviewedPostprint (published version

    Algorithmic issues in visual object recognition

    Get PDF
    This thesis is divided into two parts covering two aspects of research in the area of visual object recognition. Part I is about human detection in still images. Human detection is a challenging computer vision task due to the wide variability in human visual appearances and body poses. In this part, we present several enhancements to human detection algorithms. First, we present an extension to the integral images framework to allow for constant time computation of non-uniformly weighted summations over rectangular regions using a bundle of integral images. Such computational element is commonly used in constructing gradient-based feature descriptors, which are the most successful in shape-based human detection. Second, we introduce deformable features as an alternative to the conventional static features used in classifiers based on boosted ensembles. Deformable features can enhance the accuracy of human detection by adapting to pose changes that can be described as translations of body features. Third, we present a comprehensive evaluation framework for cascade-based human detectors. The presented framework facilitates comparison between cascade-based detection algorithms, provides a confidence measure for result, and deploys a practical evaluation scenario. Part II explores the possibilities of enhancing the speed of core algorithms used in visual object recognition using the computing capabilities of Graphics Processing Units (GPUs). First, we present an implementation of Graph Cut on GPUs, which achieves up to 4x speedup against compared to a CPU implementation. The Graph Cut algorithm has many applications related to visual object recognition such as segmentation and 3D point matching. Second, we present an efficient sparse approximation of kernel matrices for GPUs that can significantly speed up kernel based learning algorithms, which are widely used in object detection and recognition. We present an implementation of the Affinity Propagation clustering algorithm based on this representation, which is about 6 times faster than another GPU implementation based on a conventional sparse matrix representation

    Human detection in surveillance videos and its applications - a review

    Get PDF
    Detecting human beings accurately in a visual surveillance system is crucial for diverse application areas including abnormal event detection, human gait characterization, congestion analysis, person identification, gender classification and fall detection for elderly people. The first step of the detection process is to detect an object which is in motion. Object detection could be performed using background subtraction, optical flow and spatio-temporal filtering techniques. Once detected, a moving object could be classified as a human being using shape-based, texture-based or motion-based features. A comprehensive review with comparisons on available techniques for detecting human beings in surveillance videos is presented in this paper. The characteristics of few benchmark datasets as well as the future research directions on human detection have also been discussed

    Automatic Multi-Scale and Multi-Object Pedestrian and Car Detection in Digital Images Based on the Discriminative Generalized Hough Transform and Deep Convolutional Neural Networks

    Get PDF
    Many approaches have been suggested for automatic pedestrian and car detection to cope with the large variability regarding object size, occlusion, background variability, aspect and so forth. Current state-of-the-art deep learning-based frameworks rely either on a proposal generation mechanism (e.g., "Faster R-CNN") or on the inspection of image quadrants / octants (e.g., "YOLO" or "SSD"), which are then further processed with deep convolutional neural networks (CNN). In this thesis, the Discriminative Generalized Hough Transform (DGHT), which operates on edge images, is analyzed for the application to automatic multi-scale and multi-object pedestrian and car detection in 2D digital images. The analysis motivates to use the DGHT as an efficient proposal generation mechanism, followed by a proposal (bounding box) refinement and proposal acceptance or rejection based on a deep CNN. The impact of the different components of the resulting DGHT object detection pipeline as well as the amount of DGHT training data on the detection performance are analyzed in detail. Due to the low false negative rate and the low number of candidates of the DGHT as well as the high classification accuracy of the CNN, competitive performance to the state-of-the-art in pedestrian and car detection is obtained on the IAIR database with much less generated proposals than other proposal-generating algorithms, being outperformed only by YOLOv2 fine-tuned to IAIR cars. By evaluations on further databases (without retraining or adaptation) the generalization capability of the DGHT object detection pipeline is shown

    Real-time video scene analysis with heterogeneous processors

    Get PDF
    Field-Programmable Gate Arrays (FPGAs) and General Purpose Graphics Processing Units (GPUs) allow acceleration and real-time processing of computationally intensive computer vision algorithms. The decision to use either architecture in any application is determined by task-specific priorities such as processing latency, power consumption and algorithm accuracy. This choice is normally made at design time on a heuristic or fixed algorithmic basis; here we propose an alternative method for automatic runtime selection. In this thesis, we describe our PC-based system architecture containing both platforms; this provides greater flexibility and allows dynamic selection of processing platforms to suit changing scene priorities. Using the Histograms of Oriented Gradients (HOG) algorithm for pedestrian detection, we comprehensively explore algorithm implementation on FPGA, GPU and a combination of both, and show that the effect of data transfer time on overall processing performance is significant. We also characterise performance of each implementation and quantify tradeoffs between power, time and accuracy when moving processing between architectures, then specify the optimal architecture to use when prioritising each of these. We apply this new knowledge to a real-time surveillance application representative of anomaly detection problems: detecting parked vehicles in videos. Using motion detection and car and pedestrian HOG detectors implemented across multiple architectures to generate detections, we use trajectory clustering and a Bayesian contextual motion algorithm to generate an overall scene anomaly level. This is in turn used to select the architectures to run the compute-intensive detectors for the next frame on, with higher anomalies selecting faster, higher-power implementations. Comparing dynamic context-driven prioritisation of system performance against a fixed mapping of algorithms to architectures shows that our dynamic mapping method is 10% more accurate at detecting events than the power-optimised version, at the cost of 12W higher power consumption


    Get PDF

    Modeling Shape, Appearance and Motion for Human Movement Analysis

    Get PDF
    Shape, Appearance and Motion are the most important cues for analyzing human movements in visual surveillance. Representation of these visual cues should be rich, invariant and discriminative. We present several approaches to model and integrate them for human detection and segmentation, person identification, and action recognition. First, we describe a hierarchical part-template matching approach to simultaneous human detection and segmentation combining local part-based and global shape-based schemes. For learning generic human detectors, a pose-adaptive representation is developed based on a hierarchical tree matching scheme and combined with an support vector machine classifier to perform human/non-human classification. We also formulate multiple occluded human detection using a Bayesian framework and optimize it through an iterative process. We evaluated the approach on several public pedestrian datasets. Second, given regions of interest provided by human detectors, we introduce an approach to iteratively estimates segmentation via a generalized Expectation-Maximization algorithm. The approach incorporates local Markov random field constraints and global pose inferences to propagate beliefs over image space iteratively to determine a coherent segmentation. Additionally, a layered occlusion model and a probabilistic occlusion reasoning scheme are introduced to handle inter-occlusion. The approach is tested on a wide variety of real-life images. Third, we describe an approach to appearance-based person recognition. In learning, we perform discriminative analysis through pairwise coupling of training samples, and estimate a set of normalized invariant profiles by marginalizing likelihood ratio functions which reflect local appearance differences. In recognition, we calculate discriminative information-based distances by a soft voting approach, and combine them with appearance-based distances for nearest neighbor classification. We evaluated the approach on videos of 61 individuals under significant illumination and viewpoint changes. Fourth, we describe a prototype-based approach to action recognition. During training, a set of action prototypes are learned in a joint shape and motion space via kk-means clustering; During testing, humans are tracked while a frame-to-prototype correspondence is established by nearest neighbor search, and then actions are recognized using dynamic prototype sequence matching. Similarity matrices used for sequence matching are efficiently obtained by look-up table indexing. We experimented the approach on several action datasets

    Person re-Identification over distributed spaces and time

    Get PDF
    PhDReplicating the human visual system and cognitive abilities that the brain uses to process the information it receives is an area of substantial scientific interest. With the prevalence of video surveillance cameras a portion of this scientific drive has been into providing useful automated counterparts to human operators. A prominent task in visual surveillance is that of matching people between disjoint camera views, or re-identification. This allows operators to locate people of interest, to track people across cameras and can be used as a precursory step to multi-camera activity analysis. However, due to the contrasting conditions between camera views and their effects on the appearance of people re-identification is a non-trivial task. This thesis proposes solutions for reducing the visual ambiguity in observations of people between camera views This thesis first looks at a method for mitigating the effects on the appearance of people under differing lighting conditions between camera views. This thesis builds on work modelling inter-camera illumination based on known pairs of images. A Cumulative Brightness Transfer Function (CBTF) is proposed to estimate the mapping of colour brightness values based on limited training samples. Unlike previous methods that use a mean-based representation for a set of training samples, the cumulative nature of the CBTF retains colour information from underrepresented samples in the training set. Additionally, the bi-directionality of the mapping function is explored to try and maximise re-identification accuracy by ensuring samples are accurately mapped between cameras. Secondly, an extension is proposed to the CBTF framework that addresses the issue of changing lighting conditions within a single camera. As the CBTF requires manually labelled training samples it is limited to static lighting conditions and is less effective if the lighting changes. This Adaptive CBTF (A-CBTF) differs from previous approaches that either do not consider lighting change over time, or rely on camera transition time information to update. By utilising contextual information drawn from the background in each camera view, an estimation of the lighting change within a single camera can be made. This background lighting model allows the mapping of colour information back to the original training conditions and thus remove the need for 3 retraining. Thirdly, a novel reformulation of re-identification as a ranking problem is proposed. Previous methods use a score based on a direct distance measure of set features to form a correct/incorrect match result. Rather than offering an operator a single outcome, the ranking paradigm is to give the operator a ranked list of possible matches and allow them to make the final decision. By utilising a Support Vector Machine (SVM) ranking method, a weighting on the appearance features can be learned that capitalises on the fact that not all image features are equally important to re-identification. Additionally, an Ensemble-RankSVM is proposed to address scalability issues by separating the training samples into smaller subsets and boosting the trained models. Finally, the thesis looks at a practical application of the ranking paradigm in a real world application. The system encompasses both the re-identification stage and the precursory extraction and tracking stages to form an aid for CCTV operators. Segmentation and detection are combined to extract relevant information from the video, while several combinations of matching techniques are combined with temporal priors to form a more comprehensive overall matching criteria. The effectiveness of the proposed approaches is tested on datasets obtained from a variety of challenging environments including offices, apartment buildings, airports and outdoor public spaces