3 research outputs found
Minimising Human Annotation for Scalable Person Re-Identification
PhDAmong the diverse tasks performed by an intelligent distributed multi-camera surveillance system,
person re-identification (re-id) is one of the most essential. Re-id refers to associating an
individual or a group of people across non-overlapping cameras at different times and locations,
and forms the foundation of a variety of applications ranging from security and forensic search
to quotidian retail and health care. Though attracted rapidly increasing academic interests over
the past decade, it still remains a non-trivial and unsolved problem for launching a practical reid
system in real-world environments, due to the ambiguous and noisy feature of surveillance
data and the potentially dramatic visual appearance changes caused by uncontrolled variations in
human poses and divergent viewing conditions across distributed camera views.
To mitigate such visual ambiguity and appearance variations, most existing re-id approaches
rely on constructing fully supervised machine learning models with extensively labelled training
datasets which is unscalable for practical applications in the real-world. Particularly, human annotators
must exhaustively search over a vast quantity of offline collected data, manually label
cross-view matched images of a large population between every possible camera pair. Nonetheless,
having the prohibitively expensive human efforts dissipated, a trained re-id model is often
not easily generalisable and transferable, due to the elastic and dynamic operating conditions
of a surveillance system. With such motivations, this thesis proposes several scalable re-id approaches
with significantly reduced human supervision, readily applied to practical applications.
More specifically, this thesis has developed and investigated four new approaches for reducing
human labelling effort in real-world re-id as follows:
Chapter 3 The first approach is affinity mining from unlabelled data. Different from most
existing supervised approaches, this work aims to model the discriminative information for reid
without exploiting human annotations, but from the vast amount of unlabelled person image
data, thus applicable to both semi-supervised and unsupervised re-id. It is non-trivial since the
human annotated identity matching correspondence is often the key to discriminative re-id modelling.
In this chapter, an alternative strategy is explored by specifically mining two types of
affinity relationships among unlabelled data: (1) inter-view data affinity and (2) intra-view data
affinity. In particular, with such affinity information encoded as constraints, a Regularised Kernel
Subspace Learning model is developed to explicitly reduce inter-view appearance variations
and meanwhile enhance intra-view appearance disparity for more discriminative re-id matching.
Consequently, annotation costs can be immensely alleviated and a scalable re-id model is readily
to be leveraged to plenty of unlabelled data which is inexpensive to collect.
Chapter 4 The second approach is saliency discovery from unlabelled data. This chapter
continues to investigate the problem of what can be learned in unlabelled images without identity
labels annotated by human. Other than affinity mining as proposed by Chapter 3, a different solution
is proposed. That is, to discover localised visual appearance saliency of person appearances.
Intuitively, salient and atypical appearances of human are able to uniquely and representatively
describe and identify an individual, whilst also often robust to view changes and detection variances.
Motivated by this, an unsupervised Generative Topic Saliency model is proposed to jointly
perform foreground extraction, saliency detection, as well as discriminative re-id matching. This
approach completely avoids the exhaustive annotation effort for model training, and thus better
scales to real-world applications. Moreover, its automatically discovered re-id saliency representations
are shown to be semantically interpretable, suitable for generating useful visual analysis
for deployable user-oriented software tools.
Chapter 5 The third approach is incremental learning from actively labelled data. Since
learning from unlabelled data alone yields less discriminative matching results, and in some cases
there will be limited human labelling resources available for re-id modelling, this chapter thus
investigate the problem of how to maximise a model’s discriminative capability with minimised
labelling efforts. The challenges are to (1) automatically select the most representative data from
a vast number of noisy/ambiguous unlabelled data in order to maximise model discrimination
capacity; and (2) incrementally update the model parameters to accelerate machine responses
and reduce human waiting time. To that end, this thesis proposes a regression based re-id model,
characterised by its very fast and efficient incremental model updates. Furthermore, an effective
active data sampling algorithm with three novel joint exploration-exploitation criteria is designed,
to make automatic data selection feasible with notably reduced human labelling costs. Such an
approach ensures annotations to be spent only on very few data samples which are most critical
to model’s generalisation capability, instead of being exhausted by blindly labelling many noisy
and redundant training samples.
Chapter 6 The last technical area of this thesis is human-in-the-loop learning from relevance
feedback. Whilst former chapters mainly investigate techniques to reduce human supervision for
model training, this chapter motivates a novel research area to further minimise human efforts
spent in the re-id deployment stage. In real-world applications where camera network and potential
gallery size increases dramatically, even the state-of-the-art re-id models generate much
inferior re-id performances and human involvements at deployment stage is inevitable. To minimise
such human efforts and maximise re-id performance, this thesis explores an alternative
approach to re-id by formulating a hybrid human-computer learning paradigm with humans in
the model matching loop. Specifically, a Human Verification Incremental Learning model is formulated
which does not require any pre-labelled training data, therefore scalable to new camera
pairs; Moreover, the proposed model learns cumulatively from human feedback to provide an instant
improvement to re-id ranking of each probe on-the-fly, thus scalable to large gallery sizes. It
has been demonstrated that the proposed re-id model achieves significantly superior re-id results
whilst only consumes much less human supervision effort.
For facilitating a holistic understanding about this thesis, the main studies are summarised
and framed into a graphical abstract as shown in Figur
Real-world Human Re-identification: Attributes and Beyond.
PhDSurveillance systems capable of performing a diverse range of tasks that support human intelligence
and analytical efforts are becoming widespread and crucial due to increasing threats
upon national infrastructure and evolving business and governmental analytical requirements.
Surveillance data can be critical for crime-prevention, forensic analysis, and counter-terrorism
activities in both civilian and governmental agencies alike. However, visual surveillance data
must currently be parsed by trained human operators and therefore any utility is offset by the
inherent training and staffing costs as a result. The automated analysis of surveillance video is
therefore of great scientific interest. One of the open problems within this area is that of reliably
matching humans between disjoint surveillance camera views, termed re-identification.
Automated re-identification facilitates human operational efficiency in the grouping of disparate
and fragmented people observations through space and time into individual personal identities,
a pre-requisite for higher-level surveillance tasks. However, due to the complex nature of realworld
scenes and the highly variable nature of human appearance, reliably re-identifying people
is non-trivial.
Most re-identification approaches developed so far rely on low-level visual feature matching
approaches that aim to match human detections against a known gallery of potential matches.
However, for many applications an initial detection of a human may be unavailable or a low-level
feature representation may not be sufficiently invariant to photometric or geometric variability
inherent between camera views. This thesis begins by proposing a “mid-level” human-semantic
representation that exploits expert human knowledge of surveillance task execution to the task
of re-identifying people in order to compute an attribute-based description of a human. It further
shows how this attribute-based description is synergistic with low-level data-derived features
to enhance re-identification accuracy and subsequently gain further performance benefits
by employing a discriminatively learned distance metric. Finally, a novel “zero-shot” scenario is
proposed in which a visual probe is unavailable but re-identification is still possible via a manually
provided semantic attribute description. The approach is extensively evaluated using several
public benchmark datasets.
One challenge in constructing an attribute-based and human-semantic representation is the
requirement for extensive annotation. Mitigating this annotation cost in order to present a realistic
and scalable re-identification system, is motivation for the second technical area of this thesis,
where transfer-learning and data-mining are investigatedin two different approaches. Discriminative
methods trade annotation cost for enhanced performance. Because discriminative person
re-identification models operate between two camera views, annotation cost therefore scales
quadratically on the number of cameras in the entire network. For practical re-identification, this
4
is an unreasonable expectation and prohibitively expensive. By leveraging flexible multi-source
transfer of re-identification models, part of this cost may be alleviated. Specifically, it is possible
to leverage prior re-identification models learned for a set of source-view pairs (domains), and
flexibly combine those to obtain good re-identification performance for a given target-view pair
with greatly reduced annotation requirements.
The volume of exhaustive annotation effort required for attribute-driven re-identification
scales linearly on the number of cameras and attributes. Real-world operation of an attributeenabled,
distributed camera network would also require prohibitive quantities of annotation effort
by human experts. This effort is completely avoided by taking a data-driven approach to attribute
computation, by learning an effective associated representation by crawling large volumes of
Internet data. By training on a larger and more diverse array of examples, this representation
is more view-invariant and generalisable than attributes trained on conventional scales. These
automatically discovered attributes are shown to provide a valuable representation that significantly
improves re-identification performance. Moreover, a method to map them onto existing
expert-annotated-ontologies is contributed.
In the final contribution of this thesis, the underlying assumptions about visual surveillance
equipment and re-identification are challenged and the thesis motivates a novel research area
using dynamic, mobile platforms. Such platforms violate the common assumption shared by
most previous research, namely that surveillance devices are always stationary, relative to the
observed scene. The most important new challenge discovered in this exciting area is that the unconstrained
video is too challenging for traditional approaches to applying discriminative methods
that rely on the explicit modelling of appearance translations when modelling view-pairs,
or even a single view. A new dataset was collected by a remote-operated vehicle using control
software developed to simulate a fully-autonomous re-identification unmanned aerial vehicle programmed
to fly in proximity with humans until images of sufficient quality for re-identification
are obtained. Variations of the standard re-identification model are investigated in an enhanced
re-identification paradigm, and new challenges with this distinct form of re-identification are elucidated.
Finally, conventional wisdom regarding re-identification in light of these observations is
re-examined