432 research outputs found
LCrowdV: Generating Labeled Videos for Simulation-based Crowd Behavior Learning
We present a novel procedural framework to generate an arbitrary number of
labeled crowd videos (LCrowdV). The resulting crowd video datasets are used to
design accurate algorithms or training models for crowded scene understanding.
Our overall approach is composed of two components: a procedural simulation
framework for generating crowd movements and behaviors, and a procedural
rendering framework to generate different videos or images. Each video or image
is automatically labeled based on the environment, number of pedestrians,
density, behavior, flow, lighting conditions, viewpoint, noise, etc.
Furthermore, we can increase the realism by combining synthetically-generated
behaviors with real-world background videos. We demonstrate the benefits of
LCrowdV over prior lableled crowd datasets by improving the accuracy of
pedestrian detection and crowd behavior classification algorithms. LCrowdV
would be released on the WWW
Automatic object classification for surveillance videos.
PhDThe recent popularity of surveillance video systems, specially located in urban
scenarios, demands the development of visual techniques for monitoring purposes.
A primary step towards intelligent surveillance video systems consists on automatic
object classification, which still remains an open research problem and the keystone
for the development of more specific applications.
Typically, object representation is based on the inherent visual features. However,
psychological studies have demonstrated that human beings can routinely categorise
objects according to their behaviour. The existing gap in the understanding
between the features automatically extracted by a computer, such as appearance-based
features, and the concepts unconsciously perceived by human beings but
unattainable for machines, or the behaviour features, is most commonly known
as semantic gap. Consequently, this thesis proposes to narrow the semantic gap
and bring together machine and human understanding towards object classification.
Thus, a Surveillance Media Management is proposed to automatically detect and
classify objects by analysing the physical properties inherent in their appearance
(machine understanding) and the behaviour patterns which require a higher level of
understanding (human understanding). Finally, a probabilistic multimodal fusion
algorithm bridges the gap performing an automatic classification considering both
machine and human understanding.
The performance of the proposed Surveillance Media Management framework
has been thoroughly evaluated on outdoor surveillance datasets. The experiments
conducted demonstrated that the combination of machine and human understanding
substantially enhanced the object classification performance. Finally, the inclusion
of human reasoning and understanding provides the essential information to bridge
the semantic gap towards smart surveillance video systems
Semantic Spaces for Video Analysis of Behaviour
PhDThere are ever growing interests from the computer vision community into human behaviour
analysis based on visual sensors. These interests generally include: (1) behaviour recognition -
given a video clip or specific spatio-temporal volume of interest discriminate it into one or more
of a set of pre-defined categories; (2) behaviour retrieval - given a video or textual description
as query, search for video clips with related behaviour; (3) behaviour summarisation - given a
number of video clips, summarise out representative and distinct behaviours. Although countless
efforts have been dedicated into problems mentioned above, few works have attempted to
analyse human behaviours in a semantic space. In this thesis, we define semantic spaces as a
collection of high-dimensional Euclidean space in which semantic meaningful events, e.g. individual
word, phrase and visual event, can be represented as vectors or distributions which are
referred to as semantic representations. With the semantic space, semantic texts, visual events
can be quantitatively compared by inner product, distance and divergence. The introduction of
semantic spaces can bring lots of benefits for visual analysis. For example, discovering semantic
representations for visual data can facilitate semantic meaningful video summarisation, retrieval
and anomaly detection. Semantic space can also seamlessly bridge categories and datasets which
are conventionally treated independent. This has encouraged the sharing of data and knowledge
across categories and even datasets to improve recognition performance and reduce labelling effort.
Moreover, semantic space has the ability to generalise learned model beyond known classes
which is usually referred to as zero-shot learning. Nevertheless, discovering such a semantic
space is non-trivial due to (1) semantic space is hard to define manually. Humans always have
a good sense of specifying the semantic relatedness between visual and textual instances. But a
measurable and finite semantic space can be difficult to construct with limited manual supervision.
As a result, constructing semantic space from data is adopted to learn in an unsupervised
manner; (2) It is hard to build a universal semantic space, i.e. this space is always contextual
dependent. So it is important to build semantic space upon selected data such that it is always
meaningful within the context. Even with a well constructed semantic space, challenges are still
present including; (3) how to represent visual instances in the semantic space; and (4) how to mitigate
the misalignment of visual feature and semantic spaces across categories and even datasets
when knowledge/data are generalised. This thesis tackles the above challenges by exploiting data
from different sources and building contextual semantic space with which data and knowledge
can be transferred and shared to facilitate the general video behaviour analysis.
To demonstrate the efficacy of semantic space for behaviour analysis, we focus on studying
real world problems including surveillance behaviour analysis, zero-shot human action recognition
and zero-shot crowd behaviour recognition with techniques specifically tailored for the
nature of each problem.
Firstly, for video surveillances scenes, we propose to discover semantic representations from
the visual data in an unsupervised manner. This is due to the largely availability of unlabelled
visual data in surveillance systems. By representing visual instances in the semantic space, data
and annotations can be generalised to new events and even new surveillance scenes. Specifically,
to detect abnormal events this thesis studies a geometrical alignment between semantic representation
of events across scenes. Semantic actions can be thus transferred to new scenes and
abnormal events can be detected in an unsupervised way. To model multiple surveillance scenes
simultaneously, we show how to learn a shared semantic representation across a group of semantic
related scenes through a multi-layer clustering of scenes. With multi-scene modelling we
show how to improve surveillance tasks including scene activity profiling/understanding, crossscene
query-by-example, behaviour classification, and video summarisation.
Secondly, to avoid extremely costly and ambiguous video annotating, we investigate how
to generalise recognition models learned from known categories to novel ones, which is often
termed as zero-shot learning. To exploit the limited human supervision, e.g. category names,
we construct the semantic space via a word-vector representation trained on large textual corpus
in an unsupervised manner. Representation of visual instance in semantic space is obtained by
learning a visual-to-semantic mapping. We notice that blindly applying the mapping learned
from known categories to novel categories can cause bias and deteriorating the performance
which is termed as domain shift. To solve this problem we employed techniques including semisupervised
learning, self-training, hubness correction, multi-task learning and domain adaptation.
All these methods in combine achieve state-of-the-art performance in zero-shot human action
task.
In the last, we study the possibility to re-use known and manually labelled semantic crowd
attributes to recognise rare and unknown crowd behaviours. This task is termed as zero-shot
crowd behaviours recognition. Crucially we point out that given the multi-labelled nature of
semantic crowd attributes, zero-shot recognition can be improved by exploiting the co-occurrence
between attributes.
To summarise, this thesis studies methods for analysing video behaviours and demonstrates
that exploring semantic spaces for video analysis is advantageous and more importantly enables
multi-scene analysis and zero-shot learning beyond conventional learning strategies
Abnormal Crowd Behavior Detection Using Motion Information Images and Convolutional Neural Networks
We introduce a novel method for abnormal crowd event detection in surveillance videos.
Particularly, our work focuses on panic and escape behavior detection that may appear because of violent
events and natural disasters. First, optical flow vectors are computed to generate a motion information
image (MII) for each frame, and then MIIs are used to train a convolutional neural network (CNN) for
abnormal crowd event detection. The proposed MII is a new formulation that provides a visual appearance of
crowd motion. The proposed MIIs make the discrimination between normal and abnormal behaviors easier.
The MII is mainly based on the optical flow magnitude, and angle difference computed between the optical
flow vectors in consecutive frames. A CNN is employed to learn normal and abnormal crowd behaviors
using MIIs. The MII generation, and the combination with a CNN is a new approach in the context of
abnormal crowd behavior detection. Experiments are performed on commonly used datasets such as UMN
and PETS2009. Evaluation indicates that our method achieves the best results.Publisher's Versio
- …