1,224 research outputs found

    FROM VISUAL SALIENCY TO VIDEO BEHAVIOUR UNDERSTANDING

    Get PDF
    In a world of ever increasing amounts of video data, we are forced to abandon traditional methods of scene interpretation by fully manual means. Under such circumstances, some form of automation is highly desirable but this can be a very open ended issue with high complexity. Dealing with such large amounts of data is a non-trivial task that requires efficient selective extraction of parts of a scene which have the potential to develop a higher semantic meaning, alone, or in combination with others. In particular, the types of video data that are in need of automated analysis tend to be outdoor scenes with high levels of activity generated from either foreground or background. Such dynamic scenes add considerable complexity to the problem since we cannot rely on motion energy alone to detect regions of interest. Furthermore, the behaviour of these regions of motion can differ greatly, while still being highly dependent, both spatially and temporally on the movement of other objects within the scene. Modelling these dependencies, whilst eliminating as much redundancy from the feature extraction process as possible are the challenges addressed by this thesis. In the first half, finding the right mechanism to extract and represent meaningful features from dynamic scenes with no prior knowledge is investigated. Meaningful or salient information is treated as the parts of a scene that stand out or seem unusual or interesting to us. The novelty of the work is that it is able to select salient scales in both space and time in which a particular spatio-temporal volume is considered interesting relative to the rest of the scene. By quantifying the temporal saliency values of regions of motion, it is possible to consider their importance in terms of both the long and short-term. Variations in entropy over spatio-temporal scales are used to select a context dependent measure of the local scene dynamics. A method of quantifying temporal saliency is devised based on the variation of the entropy of the intensity distribution in a spatio-temporal volume over incraeasing scales. Entropy is used over traditional filter methods since the stability or predictability of the intensity distribution over scales of a local spatio-temporal region can be defined more robustly relative to the context of its neighbourhood, even for regions exhibiting high intensity variation due to being extremely textured. Results show that it is possible to extract both locally salient features as well as globally salient temporal features from contrasting scenerios. In the second part of the thesis, focus will shift towards binding these spatio-temporally salient features together so that some semantic meaning can be inferred from their interaction. Interaction in this sense, refers to any form of temporally correlated behaviour between any salient regions of motion in a scene. Feature binding as a mechanism for interactive behaviour understanding is particularly important if we consider that regions of interest may not be treated as particularly significant individually, but represent much more semantically when considered in combination. Temporally correlated behaviour is identified and classified using accumulated co-occurrences of salient features at two levels. Firstly, co-occurrences are accumulated for spatio-temporally proximate salient features to form a local representation. Then, at the next level, the co-occurrence of these locally spatio-temporally bound features are accumulated again in order to discover unusual behaviour in the scene. The novelty of this work is that there are no assumptions made about whether interacting regions should be spatially proximate. Furthermore, no prior knowledge of the scene topology is used. Results show that it is possible to detect unusual interactions between regions of motion, which can visually infer higher levels of semantics. In the final part of the thesis, a more specific investigation of human behaviour is addressed through classification and detection of interactions between 2 human subjects. Here, further modifications are made to the feature extraction process in order to quantify the spatiotemporal saliency of a region of motion. These features are then grouped to find the people in the scene. Then, a loose pose distribution model is extracted for each person for finding salient correlations between poses of two interacting people using canonical correlation analysis. These canonical factors can be formed into trajectories and used for classification. Levenshtein distance is then used to categorise the features. The novelty of the work is that the interactions do not have to be spatially connected or proximate for them to be recognised. Furthermore, the data used is outdoors and cluttered with non-stationary background. Results show that co-occurrence techniques have the potential to provide a more generalised, compact, and meaningful representation of dynamic interactive scene behaviour.EPRSC, part-funded by QinetiQ Ltd and a travel grant was also contributed by RAEng

    Unmasking Clever Hans Predictors and Assessing What Machines Really Learn

    Full text link
    Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to well-informed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.Comment: Accepted for publication in Nature Communication

    Techniques For Video Surveillance: Automatic Video Editing And Target Tracking

    Get PDF
    Typical video surveillance control rooms include a collection of monitors connected to a large camera network, with many fewer operators than monitors. The cameras are usually cycled through the monitors, with provisions for manual over-ride to display a camera of interest. In addition, cameras are often provided with pan, tilt and zoom capabilities to capture objects of interest. In this dissertation, we develop novel ways to control the limited resources by focusing them into acquiring and visualizing the critical information contained in the surveyed scenes. First, we consider the problem of cropping surveillance videos. This process chooses a trajectory that a small sub-window can take through the video, selecting the most important parts of the video for display on a smaller monitor area. We model the information content of the video simply, by whether the image changes at each pixel. Then we show that we can find the globally optimal trajectory for a cropping window by using a shortest path algorithm. In practice, we can speed up this process without affecting the results, by stitching together trajectories computed over short intervals. This also reduces system latency. We then show that we can use a second shortest path formulation to find good cuts from one trajectory to another, improving coverage of interesting events in the video. We describe additional techniques to improve the quality and efficiency of the algorithm, and show results on surveillance videos. Second, we turn our attention to the problem of tracking multiple agents moving amongst obstacles, using multiple cameras. Given an environment with obstacles, and many people moving through it, we construct a separate narrow field of view video for as many people as possible, by stitching together video segments from multiple cameras over time. We employ a novel approach to assign cameras to people as a function of time, with camera switches when needed. The problem is modeled as a bipartite graph and the solution corresponds to a maximum matching. As people move, the solution is efficiently updated by computing an augmenting path rather than by solving for a new matching. This reduces computation time by an order of magnitude. In addition, solving for the shortest augmenting path minimizes the number of camera switches at each update. When not all people can be covered by the available cameras, we cluster as many people as possible into small groups, then assign cameras to groups using a minimum cost matching algorithm. We test our method using numerous runs from different simulators. Third, we relax the restriction of using fixed cameras in tracking agents. In particular, we study the problem of maintaining a good view of an agent moving amongst obstacles by a moving camera, possibly fixed to a pursuing robot. This is known as a two-player pursuit evasion game. Using a mesh discretization of the environment, we develop an algorithm that determines, given initial positions of both pursuer and evader, if the evader can take any moving strategy to go out of sight of the pursuer, and thus win the game. If it is decided that there is no winning strategy for the evader, we also compute a pursuer's trajectory that keeps the evader within sight, for every trajectory that the evader can take. We study the effect of varying the mesh size on both the efficiency and accuracy of our algorithm. Finally, we show some earlier work that has been done in the domain of anomaly detection. Based on modeling co-occurrence statistics of moving objects in time and space, experiments are described on synthetic data, in which time intervals and locations of unusual activity are identified

    Activity understanding and unusual event detection in surveillance videos

    Get PDF
    PhDComputer scientists have made ceaseless efforts to replicate cognitive video understanding abilities of human brains onto autonomous vision systems. As video surveillance cameras become ubiquitous, there is a surge in studies on automated activity understanding and unusual event detection in surveillance videos. Nevertheless, video content analysis in public scenes remained a formidable challenge due to intrinsic difficulties such as severe inter-object occlusion in crowded scene and poor quality of recorded surveillance footage. Moreover, it is nontrivial to achieve robust detection of unusual events, which are rare, ambiguous, and easily confused with noise. This thesis proposes solutions for resolving ambiguous visual observations and overcoming unreliability of conventional activity analysis methods by exploiting multi-camera visual context and human feedback. The thesis first demonstrates the importance of learning visual context for establishing reliable reasoning on observed activity in a camera network. In the proposed approach, a new Cross Canonical Correlation Analysis (xCCA) is formulated to discover and quantify time delayed pairwise correlations of regional activities observed within and across multiple camera views. This thesis shows that learning time delayed pairwise activity correlations offers valuable contextual information for (1) spatial and temporal topology inference of a camera network, (2) robust person re-identification, and (3) accurate activity-based video temporal segmentation. Crucially, in contrast to conventional methods, the proposed approach does not rely on either intra-camera or inter-camera object tracking; it can thus be applied to low-quality surveillance videos featuring severe inter-object occlusions. Second, to detect global unusual event across multiple disjoint cameras, this thesis extends visual context learning from pairwise relationship to global time delayed dependency between regional activities. Specifically, a Time Delayed Probabilistic Graphical Model (TD-PGM) is proposed to model the multi-camera activities and their dependencies. Subtle global unusual events are detected and localised using the model as context-incoherent patterns across multiple camera views. In the model, different nodes represent activities in different decomposed re3 gions from different camera views, and the directed links between nodes encoding time delayed dependencies between activities observed within and across camera views. In order to learn optimised time delayed dependencies in a TD-PGM, a novel two-stage structure learning approach is formulated by combining both constraint-based and scored-searching based structure learning methods. Third, to cope with visual context changes over time, this two-stage structure learning approach is extended to permit tractable incremental update of both TD-PGM parameters and its structure. As opposed to most existing studies that assume static model once learned, the proposed incremental learning allows a model to adapt itself to reflect the changes in the current visual context, such as subtle behaviour drift over time or removal/addition of cameras. Importantly, the incremental structure learning is achieved without either exhaustive search in a large graph structure space or storing all past observations in memory, making the proposed solution memory and time efficient. Forth, an active learning approach is presented to incorporate human feedback for on-line unusual event detection. Contrary to most existing unsupervised methods that perform passive mining for unusual events, the proposed approach automatically requests supervision for critical points to resolve ambiguities of interest, leading to more robust detection of subtle unusual events. The active learning strategy is formulated as a stream-based solution, i.e. it makes decision on-the-fly on whether to request label for each unlabelled sample observed in sequence. It selects adaptively two active learning criteria, namely likelihood criterion and uncertainty criterion to achieve (1) discovery of unknown event classes and (2) refinement of classification boundary. The effectiveness of the proposed approaches is validated using videos captured from busy public scenes such as underground stations and traffic intersections

    Change blindness: eradication of gestalt strategies

    Get PDF
    Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149–164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task

    Visual attention models for far-field scene analysis

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 141-146).The amount of information available to an intelligent monitoring system is simply too vast to process in its entirety. One way to address this issue is by developing attentive mechanisms that recognize parts of the input as more interesting than others. We apply this concept to the domain of far-field activity analysis by addressing the problem of determining where to look in a scene in order to capture interesting activity in progress. We pose the problem of attention as an unsupervised learning problem, in which the task is to learn from long-term observation a model of the usual pattern of activity. Such a statistical scene model then makes it possible to detect and attend to examples of unusual activity. We present two data-driven scene modeling approaches. In the first, we model the pattern of individual observations (instances) of moving objects at each scene location as a mixture of Gaussians. In the second approach, we model the pattern of sequences of observations -- tracks -- by grouping them into clusters.We employ a similarity measure that combines comparisons of multiple attributes -- such as size, position, and velocity -- in a principled manner so that only tracks that are spatially similar and have similar attributes at spatially corresponding points are grouped together. We group the tracks using spectral clustering and represent the scene model as a mixture of Gaussians in the spectral embedding space. New examples of activity can be efficiently classified by projection into the embedding space. We demonstrate clustering and unusual activity detection results on a week of activity in the scene (about 40,000 moving object tracks) and show that human perceptual judgments of unusual activity are well-correlated with the statistical model. The human validation suggests that the track-based anomaly detection framework would perform well as a classifier for unusual events. To our knowledge, our work is the first to evaluate a statistical scene modeling and anomaly detection framework against human judgments.by Tomáš Ižo.Ph.D

    Neural and Behavioral Consequences of Perceptual Organization using Proto-Objects

    Get PDF
    The human visual system utilizes attention to direct processing towards areas of interest. In particular, certain objects in a visual scene can be salient, meaning they attract attention rather than being the targets of some search process. Visual salience appears to be driven by the formation of visual proto-objects, which have been hypothesized to cause an increase in synchronous firing between neurons encoding parts of an object. This thesis approaches proto-objects both from a behavioral level and at a low level of analyzing synchrony. At the behavioral level, existing studies of visual salience rely on many repetitive trials or task instructions to tell study participants what to do, which can influence attentive behavior in a top-down manner, confounding the measurement of salience. I introduce an experimental paradigm that records attentional selections from subjects without any such information, and used this paradigm to analyze whether proto-objects interact in the determination of salience. The results show that uniqueness of an object does indeed attract attention, and I develop a model that normalizes among proto-objects to explain the measured data. At the neuronal level, I develop a more rapid method to perform jitter hypothesis tests regarding detecting the presence of synchronous spiking between pairs of neurons. While the detection of synchrony does imply some connection between neurons, I also show that the inference of a change in common input from changes in synchrony is not possible

    A cross-modal investigation into the relationships between bistable perception and a global temporal mechanism

    Get PDF
    When the two eyes are presented with sufficiently different images, Binocular Rivalry (BR) occurs. BR is a form of bistable perception involving stochastic alternations in awareness between distinct images shown to each eye. It has been suggested that the dynamics of BR are due to the activity of a central temporal process and are linked to involuntary mechanisms of selective attention (aka exogenous attention). To test these ideas, stimuli designed to evoke exogenous attention and central temporal processes were employed during BR observation. These stimuli included auditory and visual looming motion and streams of transient events of varied temporal rate and pattern. Although these stimuli exerted a strong impact over some aspects of BR, they were unable to override its characteristic stochastic pattern of alternations completely. It is concluded that BR is subject to distributed influences, but ultimately, is achieved in neural processing areas specific to the binocular conflict

    Amazon Nights II: Electric Boogaloo-Neural Adaptations for Communication in Three Species of Weakly Electric FIsh

    Get PDF
    Sensory systems have to extract useful information from environments awash in noise and confounding input. Studying how salient signals are encoded and filtered from these natural backgrounds is a key problem in neuroscience. Communication is a particularly tractable tool for studying this problem, as it is a ubiquitous task that all organisms must accomplish, easily compared across species, and is of significant ethological relevance. In this chapter I describe the current knowledge of what is both known and still unknown about how sensory systems are adapted for the challenges of encoding conspecific signals, particularly in environments complicated by conspecific-generated noise. The second half of this chapter describes why weakly electric fish are particularly suited to investigating how communication can shape the nervous system to accomplish this task
    • …
    corecore