138 research outputs found
Semantic Spaces for Video Analysis of Behaviour
PhDThere are ever growing interests from the computer vision community into human behaviour
analysis based on visual sensors. These interests generally include: (1) behaviour recognition -
given a video clip or specific spatio-temporal volume of interest discriminate it into one or more
of a set of pre-defined categories; (2) behaviour retrieval - given a video or textual description
as query, search for video clips with related behaviour; (3) behaviour summarisation - given a
number of video clips, summarise out representative and distinct behaviours. Although countless
efforts have been dedicated into problems mentioned above, few works have attempted to
analyse human behaviours in a semantic space. In this thesis, we define semantic spaces as a
collection of high-dimensional Euclidean space in which semantic meaningful events, e.g. individual
word, phrase and visual event, can be represented as vectors or distributions which are
referred to as semantic representations. With the semantic space, semantic texts, visual events
can be quantitatively compared by inner product, distance and divergence. The introduction of
semantic spaces can bring lots of benefits for visual analysis. For example, discovering semantic
representations for visual data can facilitate semantic meaningful video summarisation, retrieval
and anomaly detection. Semantic space can also seamlessly bridge categories and datasets which
are conventionally treated independent. This has encouraged the sharing of data and knowledge
across categories and even datasets to improve recognition performance and reduce labelling effort.
Moreover, semantic space has the ability to generalise learned model beyond known classes
which is usually referred to as zero-shot learning. Nevertheless, discovering such a semantic
space is non-trivial due to (1) semantic space is hard to define manually. Humans always have
a good sense of specifying the semantic relatedness between visual and textual instances. But a
measurable and finite semantic space can be difficult to construct with limited manual supervision.
As a result, constructing semantic space from data is adopted to learn in an unsupervised
manner; (2) It is hard to build a universal semantic space, i.e. this space is always contextual
dependent. So it is important to build semantic space upon selected data such that it is always
meaningful within the context. Even with a well constructed semantic space, challenges are still
present including; (3) how to represent visual instances in the semantic space; and (4) how to mitigate
the misalignment of visual feature and semantic spaces across categories and even datasets
when knowledge/data are generalised. This thesis tackles the above challenges by exploiting data
from different sources and building contextual semantic space with which data and knowledge
can be transferred and shared to facilitate the general video behaviour analysis.
To demonstrate the efficacy of semantic space for behaviour analysis, we focus on studying
real world problems including surveillance behaviour analysis, zero-shot human action recognition
and zero-shot crowd behaviour recognition with techniques specifically tailored for the
nature of each problem.
Firstly, for video surveillances scenes, we propose to discover semantic representations from
the visual data in an unsupervised manner. This is due to the largely availability of unlabelled
visual data in surveillance systems. By representing visual instances in the semantic space, data
and annotations can be generalised to new events and even new surveillance scenes. Specifically,
to detect abnormal events this thesis studies a geometrical alignment between semantic representation
of events across scenes. Semantic actions can be thus transferred to new scenes and
abnormal events can be detected in an unsupervised way. To model multiple surveillance scenes
simultaneously, we show how to learn a shared semantic representation across a group of semantic
related scenes through a multi-layer clustering of scenes. With multi-scene modelling we
show how to improve surveillance tasks including scene activity profiling/understanding, crossscene
query-by-example, behaviour classification, and video summarisation.
Secondly, to avoid extremely costly and ambiguous video annotating, we investigate how
to generalise recognition models learned from known categories to novel ones, which is often
termed as zero-shot learning. To exploit the limited human supervision, e.g. category names,
we construct the semantic space via a word-vector representation trained on large textual corpus
in an unsupervised manner. Representation of visual instance in semantic space is obtained by
learning a visual-to-semantic mapping. We notice that blindly applying the mapping learned
from known categories to novel categories can cause bias and deteriorating the performance
which is termed as domain shift. To solve this problem we employed techniques including semisupervised
learning, self-training, hubness correction, multi-task learning and domain adaptation.
All these methods in combine achieve state-of-the-art performance in zero-shot human action
task.
In the last, we study the possibility to re-use known and manually labelled semantic crowd
attributes to recognise rare and unknown crowd behaviours. This task is termed as zero-shot
crowd behaviours recognition. Crucially we point out that given the multi-labelled nature of
semantic crowd attributes, zero-shot recognition can be improved by exploiting the co-occurrence
between attributes.
To summarise, this thesis studies methods for analysing video behaviours and demonstrates
that exploring semantic spaces for video analysis is advantageous and more importantly enables
multi-scene analysis and zero-shot learning beyond conventional learning strategies
DOMAIN ADAPTIVE OBJECT RECOGNITION AND DETECTION
Discriminative learning algorithms rely on the assumption that training and test data are drawn from the same marginal probability distribution. In real world applications, however, this assumption is often violated and results in a significant performance drop. We often have sufficient labeled training data from single or multiple "source" domains but wish to learn a classifier which performs well on a "target" domain with a different distribution and no labeled training data. In visual object detection, for example, where the goal is to locate the objects of interest in a given image, it may be infeasible to collect training data to model the enormous variety of possible combinations of pose, background, resolution, and lighting conditions affecting object appearance. Thus, we generally expect to encounter instances or domains at test time for which we have seen little or no training data.
To this end, we first propose a framework for domain adaptive object recognition and detection using Transfer Component Analysis, an unsupervised domain adaptation and dimensionality reduction technique. The idea is to obtain a transformation in feature space to a latent subspace that reduces the distance between the source and target data distributions. We evaluate the effectiveness of this approach for vehicle detection using video frames from 50 different surveillance cameras.
Next, we explore the problem of extreme class imbalance present when performing fully unsupervised domain adaptation for object detection. The main challenge arises from the fact that images in unconstrained settings are mostly occupied by the background (negative class). Therefore, random sampling will not be effective in obtaining a sufficient number of positive samples from the target domain, which is required by any adaptation method. We propose a variation of co-learning technique that automatically constructs a more balanced set of samples from the target domain. We compare the performance of our technique with other approaches such as unbiased learning from multiple datasets and self-learning.
Finally, we propose a novel approach for unsupervised domain adaptation. Our method learns a set of binary attributes for classification that captures the structural information of the data distribution in the target domain itself. The key insight is finding attributes that are discriminative across categories and predictable across domains. We formulate our optimization problem to learn these attributes and the classifier jointly. We evaluate the performance of our method on a wide range of tasks including cross-domain object recognition and sentiment analysis on textual data both in inductive and transductive settings. We achieve a performance that significantly exceeds the state-of-the-art results on standard benchmarks. In many cases we reach the same-domain performance, the upper bound, in unsupervised domain adaptation scenarios
Adapting pedestrian detectors to new domains: A comprehensive review.
Successful detection and localisation of pedestrians is an important goal in computer vision which is a core area in Artificial Intelligence. State-of-the-art pedestrian detectors proposed in literature have reached impressive performance on certain datasets. However, it has been pointed out that these detectors tend not to perform very well when applied to specific scenes that differ from the training datasets in some ways. Due to this, domain adaptation approaches have recently become popular in order to adapt existing detectors to new domains to improve the performance in those domains. There is a real need to review and analyse critically the state-of-the-art domain adaptation algorithms, especially in the area of object and pedestrian detection. In this paper, we survey the most relevant and important state-of-the-art results for domain adaptation for image and video data, with a particular focus on pedestrian detection. Related areas to domain adaptation are also included in our review and we make observations and draw conclusions from the representative papers and give practical recommendations on which methods should be preferred in different situations that practitioners may encounter in real-life
Learning Transferable Representations for Visual Recognition
In the last half-decade, a new renaissance of machine learning originates from the applications of convolutional neural networks to visual recognition tasks. It is believed that a combination of big curated data and novel deep learning techniques can lead to unprecedented results. However, the increasingly large training data is still a drop in the ocean compared with scenarios in the wild. In this literature, we focus on learning transferable representation in the neural networks to ensure the models stay robust, even given different data distributions. We present three exemplar topics in three chapters, respectively: zero-shot learning, domain adaptation, and generalizable adversarial attack. By zero-shot learning, we enable models to predict labels not seen in the training phase. By domain adaptation, we improve a model\u27s performance on the target domain by mitigating its discrepancy from a labeled source model, without any target annotation. Finally, the generalization adversarial attack focuses on learning an adversarial camouflage that ideally would work in every possible scenario. Despite sharing the same transfer learning philosophy, each of the proposed topics poses a unique challenge requiring a unique solution. In each chapter, we introduce the problem as well as present our solution to the problem. We also discuss some other researchers\u27 approaches and compare our solution to theirs in the experiments
3D objects and scenes classification, recognition, segmentation, and reconstruction using 3D point cloud data: A review
Three-dimensional (3D) point cloud analysis has become one of the attractive
subjects in realistic imaging and machine visions due to its simplicity,
flexibility and powerful capacity of visualization. Actually, the
representation of scenes and buildings using 3D shapes and formats leveraged
many applications among which automatic driving, scenes and objects
reconstruction, etc. Nevertheless, working with this emerging type of data has
been a challenging task for objects representation, scenes recognition,
segmentation, and reconstruction. In this regard, a significant effort has
recently been devoted to developing novel strategies, using different
techniques such as deep learning models. To that end, we present in this paper
a comprehensive review of existing tasks on 3D point cloud: a well-defined
taxonomy of existing techniques is performed based on the nature of the adopted
algorithms, application scenarios, and main objectives. Various tasks performed
on 3D point could data are investigated, including objects and scenes
detection, recognition, segmentation and reconstruction. In addition, we
introduce a list of used datasets, we discuss respective evaluation metrics and
we compare the performance of existing solutions to better inform the
state-of-the-art and identify their limitations and strengths. Lastly, we
elaborate on current challenges facing the subject of technology and future
trends attracting considerable interest, which could be a starting point for
upcoming research studie
- …