758 research outputs found

    Graph-based topic models for trajectory clustering in crowd videos

    Get PDF
    Probabilistic topic modelings, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM), have recently emerged as powerful statistical tools for processing video content. They share an important property, i.e., using a common set of topics to model all data. However such property can be too restrictive for modeling complex visual data such as crowd scenes where multiple fields of heterogeneous data jointly provide rich information about objects and events. This paper proposes graph-based extensions of LDA and CTM, referred to as GLDA and GCTM, to learn and analyze motion patterns by trajectory clustering in a highly cluttered and crowded environment. Unlike previous works that relied on a scene prior, we apply a spatio-temporal graph (STG) to uncover the spatial and temporal coherence between the trajectories of crowd motion during the learning process. The presented models advance the conventional approaches by integrating a manifold-based clustering as initialization and iterative statistical inference as optimization. The output of GLDA and GCTM are mid-level features that represent the motion patterns used later to generate trajectory clusters. Experiments on three different datasets show the effectiveness of the approaches in trajectory clustering and crowd motion modeling

    Graph-based correlated topic model for trajectory clustering in crowded videos

    Get PDF
    This paper presents a graph-based correlated topic model (GCTM) to analyse various motion patterns by trajectory clustering in a highly cluttered and crowded environment. Unlike the existing methods that address trajectory clustering and crowd motion modelling using local motion features such as optical flow, it builds on trajectory segments extracted from crowded scenes. Correlated topic models have been previously applied to handle mid-level features learning in crowded scenes. However it depends on scene priors in the learning process. GCTM addresses this issue by using a spatio-temporal graph and manifold-based clustering as initialization and iterative statistical inference as optimization. The output of GCTM is mid-level features used later as an input to the final step that generates trajectory clusters. Experiments on two different datasets show the effectiveness of the approach in trajectory clustering and crowd motion modelling

    Activity understanding and unusual event detection in surveillance videos

    Get PDF
    PhDComputer scientists have made ceaseless efforts to replicate cognitive video understanding abilities of human brains onto autonomous vision systems. As video surveillance cameras become ubiquitous, there is a surge in studies on automated activity understanding and unusual event detection in surveillance videos. Nevertheless, video content analysis in public scenes remained a formidable challenge due to intrinsic difficulties such as severe inter-object occlusion in crowded scene and poor quality of recorded surveillance footage. Moreover, it is nontrivial to achieve robust detection of unusual events, which are rare, ambiguous, and easily confused with noise. This thesis proposes solutions for resolving ambiguous visual observations and overcoming unreliability of conventional activity analysis methods by exploiting multi-camera visual context and human feedback. The thesis first demonstrates the importance of learning visual context for establishing reliable reasoning on observed activity in a camera network. In the proposed approach, a new Cross Canonical Correlation Analysis (xCCA) is formulated to discover and quantify time delayed pairwise correlations of regional activities observed within and across multiple camera views. This thesis shows that learning time delayed pairwise activity correlations offers valuable contextual information for (1) spatial and temporal topology inference of a camera network, (2) robust person re-identification, and (3) accurate activity-based video temporal segmentation. Crucially, in contrast to conventional methods, the proposed approach does not rely on either intra-camera or inter-camera object tracking; it can thus be applied to low-quality surveillance videos featuring severe inter-object occlusions. Second, to detect global unusual event across multiple disjoint cameras, this thesis extends visual context learning from pairwise relationship to global time delayed dependency between regional activities. Specifically, a Time Delayed Probabilistic Graphical Model (TD-PGM) is proposed to model the multi-camera activities and their dependencies. Subtle global unusual events are detected and localised using the model as context-incoherent patterns across multiple camera views. In the model, different nodes represent activities in different decomposed re3 gions from different camera views, and the directed links between nodes encoding time delayed dependencies between activities observed within and across camera views. In order to learn optimised time delayed dependencies in a TD-PGM, a novel two-stage structure learning approach is formulated by combining both constraint-based and scored-searching based structure learning methods. Third, to cope with visual context changes over time, this two-stage structure learning approach is extended to permit tractable incremental update of both TD-PGM parameters and its structure. As opposed to most existing studies that assume static model once learned, the proposed incremental learning allows a model to adapt itself to reflect the changes in the current visual context, such as subtle behaviour drift over time or removal/addition of cameras. Importantly, the incremental structure learning is achieved without either exhaustive search in a large graph structure space or storing all past observations in memory, making the proposed solution memory and time efficient. Forth, an active learning approach is presented to incorporate human feedback for on-line unusual event detection. Contrary to most existing unsupervised methods that perform passive mining for unusual events, the proposed approach automatically requests supervision for critical points to resolve ambiguities of interest, leading to more robust detection of subtle unusual events. The active learning strategy is formulated as a stream-based solution, i.e. it makes decision on-the-fly on whether to request label for each unlabelled sample observed in sequence. It selects adaptively two active learning criteria, namely likelihood criterion and uncertainty criterion to achieve (1) discovery of unknown event classes and (2) refinement of classification boundary. The effectiveness of the proposed approaches is validated using videos captured from busy public scenes such as underground stations and traffic intersections

    Visual Analysis of Extremely Dense Crowded Scenes

    Get PDF
    Visual analysis of dense crowds is particularly challenging due to large number of individuals, occlusions, clutter, and fewer pixels per person which rarely occur in ordinary surveillance scenarios. This dissertation aims to address these challenges in images and videos of extremely dense crowds containing hundreds to thousands of humans. The goal is to tackle the fundamental problems of counting, detecting and tracking people in such images and videos using visual and contextual cues that are automatically derived from the crowded scenes. For counting in an image of extremely dense crowd, we propose to leverage multiple sources of information to compute an estimate of the number of individuals present in the image. Our approach relies on sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region. Furthermore, we employ a global consistency constraint on counts using Markov Random Field which caters for disparity in counts in local neighborhoods and across scales. We tested this approach on crowd images with the head counts ranging from 94 to 4543 and obtained encouraging results. Through this approach, we are able to count people in images of high-density crowds unlike previous methods which are only applicable to videos of low to medium density crowded scenes. However, the counting procedure just outputs a single number for a large patch or an entire image. With just the counts, it becomes difficult to measure the counting error for a query image with unknown number of people. For this, we propose to localize humans by finding repetitive patterns in the crowd image. Starting with detections from an underlying head detector, we correlate them within the image after their selection through several criteria: in a pre-defined grid, locally, or at multiple scales by automatically finding the patches that are most representative of recurring patterns in the crowd image. Finally, the set of generated hypotheses is selected using binary integer quadratic programming with Special Ordered Set (SOS) Type 1 constraints. Human Detection is another important problem in the analysis of crowded scenes where the goal is to place a bounding box on visible parts of individuals. Primarily applicable to images depicting medium to high density crowds containing several hundred humans, it is a crucial pre-requisite for many other visual tasks, such as tracking, action recognition or detection of anomalous behaviors, exhibited by individuals in a dense crowd. For detecting humans, we explore context in dense crowds in the form of locally-consistent scale prior which captures the similarity in scale in local neighborhoods with smooth variation over the image. Using the scale and confidence of detections obtained from an underlying human detector, we infer scale and confidence priors using Markov Random Field. In an iterative mechanism, the confidences of detections are modified to reflect consistency with the inferred priors, and the priors are updated based on the new detections. The final set of detections obtained are then reasoned for occlusion using Binary Integer Programming where overlaps and relations between parts of individuals are encoded as linear constraints. Both human detection and occlusion reasoning in this approach are solved with local neighbor-dependent constraints, thereby respecting the inter-dependence between individuals characteristic to dense crowd analysis. In addition, we propose a mechanism to detect different combinations of body parts without requiring annotations for individual combinations. Once human detection and localization is performed, we then use it for tracking people in dense crowds. Similar to the use of context as scale prior for human detection, we exploit it in the form of motion concurrence for tracking individuals in dense crowds. The proposed method for tracking provides an alternative and complementary approach to methods that require modeling of crowd flow. Simultaneously, it is less likely to fail in the case of dynamic crowd flows and anomalies by minimally relying on previous frames. The approach begins with the automatic identification of prominent individuals from the crowd that are easy to track. Then, we use Neighborhood Motion Concurrence to model the behavior of individuals in a dense crowd, this predicts the position of an individual based on the motion of its neighbors. When the individual moves with the crowd flow, we use Neighborhood Motion Concurrence to predict motion while leveraging five-frame instantaneous flow in case of dynamically changing flow and anomalies. All these aspects are then embedded in a framework which imposes hierarchy on the order in which positions of individuals are updated. The results are reported on eight sequences of medium to high density crowds and our approach performs on par with existing approaches without learning or modeling patterns of crowd flow. We experimentally demonstrate the efficacy and reliability of our algorithms by quantifying the performance of counting, localization, as well as human detection and tracking on new and challenging datasets containing hundreds to thousands of humans in a given scene

    Automatic human behaviour anomaly detection in surveillance video

    Get PDF
    This thesis work focusses upon developing the capability to automatically evaluate and detect anomalies in human behaviour from surveillance video. We work with static monocular cameras in crowded urban surveillance scenarios, particularly air- ports and commercial shopping areas. Typically a person is 100 to 200 pixels high in a scene ranging from 10 - 20 meters width and depth, populated by 5 to 40 peo- ple at any given time. Our procedure evaluates human behaviour unobtrusively to determine outlying behavioural events, agging abnormal events to the operator. In order to achieve automatic human behaviour anomaly detection we address the challenge of interpreting behaviour within the context of the social and physical environment. We develop and evaluate a process for measuring social connectivity between individuals in a scene using motion and visual attention features. To do this we use mutual information and Euclidean distance to build a social similarity matrix which encodes the social connection strength between any two individuals. We de- velop a second contextual basis which acts by segmenting a surveillance environment into behaviourally homogeneous subregions which represent high tra c slow regions and queuing areas. We model the heterogeneous scene in homogeneous subgroups using both contextual elements. We bring the social contextual information, the scene context, the motion, and visual attention features together to demonstrate a novel human behaviour anomaly detection process which nds outlier behaviour from a short sequence of video. The method, Nearest Neighbour Ranked Outlier Clusters (NN-RCO), is based upon modelling behaviour as a time independent se- quence of behaviour events, can be trained in advance or set upon a single sequence. We nd that in a crowded scene the application of Mutual Information-based social context permits the ability to prevent self-justifying groups and propagate anomalies in a social network, granting a greater anomaly detection capability. Scene context uniformly improves the detection of anomalies in all the datasets we test upon. We additionally demonstrate that our work is applicable to other data domains. We demonstrate upon the Automatic Identi cation Signal data in the maritime domain. Our work is capable of identifying abnormal shipping behaviour using joint motion dependency as analogous for social connectivity, and similarly segmenting the shipping environment into homogeneous regions

    A spatio-temporal learning approach for crowd activity modelling to detect anomalies

    Get PDF
    With security and surveillance gaining paramount importance in recent years, it has become important to reliably automate some surveillance tasks for monitoring crowded areas. The need to automate this process also supports human operators who are overwhelmed with a large number of security screens to monitor. Crowd events like excess usage throughout the day, sudden peaks in crowd volume, chaotic motion (obvious to spot) all emerge over time which requires constant monitoring in order to be informed of the event build up. To ease this task, the computer vision community has been addressing some surveillance tasks using image processing and machine learning techniques. Currently tasks such as crowd density estimation or people counting, crowd detection and abnormal crowd event detection are being addressed. Most of the work has focused on crowd detection and estimation with the focus slowly shifting on crowd event learning for abnormality detection.This thesis addresses crowd abnormality detection. However, by way of the modelling approach used, implicitly, the tasks of crowd detection and estimation are also handled. The existing approaches in the literature have a number of drawbacks that keep them from being scalable for any public scene. Most pieces of work use simple scene settings where motion occurs wholly in the near-field or far-field of the camera view. Thus, with assumptions on the expected location of person motion, small blobs are arbitrarily filtered out as noise when they may be legitimate motion in the far-field. Such an approach makes it difficult to deal with complex scenes where entry/exit points occur in the centre of the scene or multiple pathways running from the near to the far-field of the camera view that produce blobs of differing sizes. Further, most authors assume the number of directions people motion should exhibit rather than discover what these may be. Approaches with such assumptions would result in loss of accuracy while dealing with (say) a railway platform which shows a number of motion directions, namely two-way, one-way, dispersive, etc. Finally, very few contributions of work use time as a video feature to model the human intuitiveness of time-of-day abnormalities. That is certain motion patterns may be abnormal if they have not been seen for a given time of day. Most works use it (time) as an extra qualifier to spatial data for trajectory definition.In this thesis most of these drawbacks have been addressed by dealing with these in the modelling of crowd activity. Firstly, no assumptions are made on scene structure or blob sizes resulting therefrom. The optical flow algorithm used is robust and even the noise presented (which is infact unwanted motion of swaying hands and legs as opposed to that from the torso) is fairly consistent and therefore can be factored into the modelling. Blobs, no matter what the size are not discarded as they may be legitimate emerging motion in the far-field. The modelling also deals with paths extending from the far to the near-field of the camera view and segments these such that each segment contains self-comparable fields of motion. The need for a normalisation factor for comparisons across near and far field motion fields implies prior knowledge of the scene. As the system is intended for generic public locations having varying scene structures, normalisation is not an option in the processing used and yet the near & far-field motion changes are accounted for. Secondly, this thesis describes a system that learns the true distribution of motion along the detected paths and maintains these. The approach is such that doing so does not generalise the direction distributions which would cause loss in precision. No impositions are made on expected motion and if the underlying motion is well defined (one-way or two-way), then this is represented as a well defined distribution and as a mixture of directions if the underlying motion presents itself as so.Finally, time as a video feature is used to allow for activity to re-enforce itself on a daily basis such that motion patterns for a given time and space begin to define themselves through re-enforcement which acts as the model used for abnormality detection in time and space (spatio-temporal). The system has been tested with real-world data datasets with varying fields of camera view. The testing has shown no false negatives, very few false positives and detects crowd abnormalities quite well with respect to the ground truths of the datasets used
    • …
    corecore