5 research outputs found
Spatial and temporal background modelling of non-stationary visual scenes
PhDThe prevalence of electronic imaging systems in everyday life has become increasingly apparent
in recent years. Applications are to be found in medical scanning, automated manufacture, and
perhaps most significantly, surveillance. Metropolitan areas, shopping malls, and road traffic
management all employ and benefit from an unprecedented quantity of video cameras for monitoring
purposes. But the high cost and limited effectiveness of employing humans as the final
link in the monitoring chain has driven scientists to seek solutions based on machine vision techniques.
Whilst the field of machine vision has enjoyed consistent rapid development in the last
20 years, some of the most fundamental issues still remain to be solved in a satisfactory manner.
Central to a great many vision applications is the concept of segmentation, and in particular,
most practical systems perform background subtraction as one of the first stages of video
processing. This involves separation of ‘interesting foreground’ from the less informative but
persistent background. But the definition of what is ‘interesting’ is somewhat subjective, and
liable to be application specific. Furthermore, the background may be interpreted as including
the visual appearance of normal activity of any agents present in the scene, human or otherwise.
Thus a background model might be called upon to absorb lighting changes, moving trees and
foliage, or normal traffic flow and pedestrian activity, in order to effect what might be termed in
‘biologically-inspired’ vision as pre-attentive selection. This challenge is one of the Holy Grails
of the computer vision field, and consequently the subject has received considerable attention.
This thesis sets out to address some of the limitations of contemporary methods of background
segmentation by investigating methods of inducing local mutual support amongst pixels
in three starkly contrasting paradigms: (1) locality in the spatial domain, (2) locality in the shortterm
time domain, and (3) locality in the domain of cyclic repetition frequency.
Conventional per pixel models, such as those based on Gaussian Mixture Models, offer no
spatial support between adjacent pixels at all. At the other extreme, eigenspace models impose
a structure in which every image pixel bears the same relation to every other pixel. But Markov
Random Fields permit definition of arbitrary local cliques by construction of a suitable graph, and
3
are used here to facilitate a novel structure capable of exploiting probabilistic local cooccurrence
of adjacent Local Binary Patterns. The result is a method exhibiting strong sensitivity to multiple
learned local pattern hypotheses, whilst relying solely on monochrome image data.
Many background models enforce temporal consistency constraints on a pixel in attempt to
confirm background membership before being accepted as part of the model, and typically some
control over this process is exercised by a learning rate parameter. But in busy scenes, a true
background pixel may be visible for a relatively small fraction of the time and in a temporally
fragmented fashion, thus hindering such background acquisition. However, support in terms of
temporal locality may still be achieved by using Combinatorial Optimization to derive shortterm
background estimates which induce a similar consistency, but are considerably more robust
to disturbance. A novel technique is presented here in which the short-term estimates act as
‘pre-filtered’ data from which a far more compact eigen-background may be constructed.
Many scenes entail elements exhibiting repetitive periodic behaviour. Some road junctions
employing traffic signals are among these, yet little is to be found amongst the literature regarding
the explicit modelling of such periodic processes in a scene. Previous work focussing on gait
recognition has demonstrated approaches based on recurrence of self-similarity by which local
periodicity may be identified. The present work harnesses and extends this method in order
to characterize scenes displaying multiple distinct periodicities by building a spatio-temporal
model. The model may then be used to highlight abnormality in scene activity. Furthermore, a
Phase Locked Loop technique with a novel phase detector is detailed, enabling such a model to
maintain correct synchronization with scene activity in spite of noise and drift of periodicity.
This thesis contends that these three approaches are all manifestations of the same broad
underlying concept: local support in each of the space, time and frequency domains, and furthermore,
that the support can be harnessed practically, as will be demonstrated experimentally
Segmenting highly textured nonstationary background
Detection of unusual objects amongst a highly textured background is a difficult problem, especially when the texture is manifest in the temporal dimension as well. Outdoor scenes involving waving trees or moving water are examples of such a scenario, but are nevertheless frequently encountered in real world vision applications. By defining a simple but rotationally sensitive Local Binary Pattern (LBP) operator and applying it in a probabilistic sense we present a compact but useful feature for tackling moving textures. But as we demonstrate, this alone is not sufficient for good segmentation in difficult circumstances. Cooccurrence of different features in a pixel’s local neighbourhood provides a powerful mechanism for boosting the reliability of the foreground/background decision task. By using the conditional probabilities yielded by pairwise cooccurrence of 4-connected pixels, and casting the problem as one of Combinatorial Optimization, our results show that useful segmentation is possible from challenging dynamic backgrounds.
Detection and Classification of Multiple Person Interaction
Institute of Perception, Action and BehaviourThis thesis investigates the classification of the behaviour of multiple persons when
viewed from a video camera. Work upon a constrained case of multiple person interaction
in the form of team games is investigated. A comparison between attempting
to model individual features using a (hierarchical dynamic model) and modelling the
team as a whole (using a support vector machine) is given. It is shown that for team
games such as handball it is preferable to model the whole team. In such instances
correct classification performance of over 80% are attained. A more general case of
interaction is then considered. Classification of interacting people in a surveillance
situation over several datasets is then investigated. We introduce a new feature set and
compare several methods with the previous best published method (Oliver 2000) and
demonstrate an improvement in performance. Classification rates of over 95% on real
video data sequences are demonstrated. An investigation into how the length of time a
sequence is observed is then performed. This results in an improved classifier (of over
2%) which uses a class dependent window size. The question of detecting pre/post and
actual fighting situations is then addressed. A hierarchical AdaBoost classifier is used
to demonstrate the ability to classify such situations. It is demonstrated that such an
approach can classify 91% of fighting situations correctly
Semantic Spaces for Video Analysis of Behaviour
PhDThere are ever growing interests from the computer vision community into human behaviour
analysis based on visual sensors. These interests generally include: (1) behaviour recognition -
given a video clip or specific spatio-temporal volume of interest discriminate it into one or more
of a set of pre-defined categories; (2) behaviour retrieval - given a video or textual description
as query, search for video clips with related behaviour; (3) behaviour summarisation - given a
number of video clips, summarise out representative and distinct behaviours. Although countless
efforts have been dedicated into problems mentioned above, few works have attempted to
analyse human behaviours in a semantic space. In this thesis, we define semantic spaces as a
collection of high-dimensional Euclidean space in which semantic meaningful events, e.g. individual
word, phrase and visual event, can be represented as vectors or distributions which are
referred to as semantic representations. With the semantic space, semantic texts, visual events
can be quantitatively compared by inner product, distance and divergence. The introduction of
semantic spaces can bring lots of benefits for visual analysis. For example, discovering semantic
representations for visual data can facilitate semantic meaningful video summarisation, retrieval
and anomaly detection. Semantic space can also seamlessly bridge categories and datasets which
are conventionally treated independent. This has encouraged the sharing of data and knowledge
across categories and even datasets to improve recognition performance and reduce labelling effort.
Moreover, semantic space has the ability to generalise learned model beyond known classes
which is usually referred to as zero-shot learning. Nevertheless, discovering such a semantic
space is non-trivial due to (1) semantic space is hard to define manually. Humans always have
a good sense of specifying the semantic relatedness between visual and textual instances. But a
measurable and finite semantic space can be difficult to construct with limited manual supervision.
As a result, constructing semantic space from data is adopted to learn in an unsupervised
manner; (2) It is hard to build a universal semantic space, i.e. this space is always contextual
dependent. So it is important to build semantic space upon selected data such that it is always
meaningful within the context. Even with a well constructed semantic space, challenges are still
present including; (3) how to represent visual instances in the semantic space; and (4) how to mitigate
the misalignment of visual feature and semantic spaces across categories and even datasets
when knowledge/data are generalised. This thesis tackles the above challenges by exploiting data
from different sources and building contextual semantic space with which data and knowledge
can be transferred and shared to facilitate the general video behaviour analysis.
To demonstrate the efficacy of semantic space for behaviour analysis, we focus on studying
real world problems including surveillance behaviour analysis, zero-shot human action recognition
and zero-shot crowd behaviour recognition with techniques specifically tailored for the
nature of each problem.
Firstly, for video surveillances scenes, we propose to discover semantic representations from
the visual data in an unsupervised manner. This is due to the largely availability of unlabelled
visual data in surveillance systems. By representing visual instances in the semantic space, data
and annotations can be generalised to new events and even new surveillance scenes. Specifically,
to detect abnormal events this thesis studies a geometrical alignment between semantic representation
of events across scenes. Semantic actions can be thus transferred to new scenes and
abnormal events can be detected in an unsupervised way. To model multiple surveillance scenes
simultaneously, we show how to learn a shared semantic representation across a group of semantic
related scenes through a multi-layer clustering of scenes. With multi-scene modelling we
show how to improve surveillance tasks including scene activity profiling/understanding, crossscene
query-by-example, behaviour classification, and video summarisation.
Secondly, to avoid extremely costly and ambiguous video annotating, we investigate how
to generalise recognition models learned from known categories to novel ones, which is often
termed as zero-shot learning. To exploit the limited human supervision, e.g. category names,
we construct the semantic space via a word-vector representation trained on large textual corpus
in an unsupervised manner. Representation of visual instance in semantic space is obtained by
learning a visual-to-semantic mapping. We notice that blindly applying the mapping learned
from known categories to novel categories can cause bias and deteriorating the performance
which is termed as domain shift. To solve this problem we employed techniques including semisupervised
learning, self-training, hubness correction, multi-task learning and domain adaptation.
All these methods in combine achieve state-of-the-art performance in zero-shot human action
task.
In the last, we study the possibility to re-use known and manually labelled semantic crowd
attributes to recognise rare and unknown crowd behaviours. This task is termed as zero-shot
crowd behaviours recognition. Crucially we point out that given the multi-labelled nature of
semantic crowd attributes, zero-shot recognition can be improved by exploiting the co-occurrence
between attributes.
To summarise, this thesis studies methods for analysing video behaviours and demonstrates
that exploring semantic spaces for video analysis is advantageous and more importantly enables
multi-scene analysis and zero-shot learning beyond conventional learning strategies