38 research outputs found
Learning a Discriminative Null Space for Person Re-identification
Most existing person re-identification (re-id) methods focus on learning the
optimal distance metrics across camera views. Typically a person's appearance
is represented using features of thousands of dimensions, whilst only hundreds
of training samples are available due to the difficulties in collecting matched
training images. With the number of training samples much smaller than the
feature dimension, the existing methods thus face the classic small sample size
(SSS) problem and have to resort to dimensionality reduction techniques and/or
matrix regularisation, which lead to loss of discriminative power. In this
work, we propose to overcome the SSS problem in re-id distance metric learning
by matching people in a discriminative null space of the training data. In this
null space, images of the same person are collapsed into a single point thus
minimising the within-class scatter to the extreme and maximising the relative
between-class separation simultaneously. Importantly, it has a fixed dimension,
a closed-form solution and is very efficient to compute. Extensive experiments
carried out on five person re-identification benchmarks including VIPeR,
PRID2011, CUHK01, CUHK03 and Market1501 show that such a simple approach beats
the state-of-the-art alternatives, often by a big margin.Comment: accepted by CVPR201
Direct kernel biased discriminant analysis: a new content-based image retrieval relevance feedback algorithm
In recent years, a variety of relevance feedback (RF) schemes have been developed to improve the performance of content-based image retrieval (CBIR). Given user feedback information, the key to a RF scheme is how to select a subset of image features to construct a suitable dissimilarity measure. Among various RF schemes, biased discriminant analysis (BDA) based RF is one of the most promising. It is based on the observation that all positive samples are alike, while in general each negative sample is negative in its own way. However, to use BDA, the small sample size (SSS) problem is a big challenge, as users tend to give a small number of feedback samples. To explore solutions to this issue, this paper proposes a direct kernel BDA (DKBDA), which is less sensitive to SSS. An incremental DKBDA (IDKBDA) is also developed to speed up the analysis. Experimental results are reported on a real-world image collection to demonstrate that the proposed methods outperform the traditional kernel BDA (KBDA) and the support vector machine (SVM) based RF algorithms
Cross-View Learning
PhDKey to achieving more efficient machine intelligence is the capability to analysing and understanding
data across different views – which can be camera views or modality views (such as
visual and textual). One generic learning paradigm for automated understanding data from different
views called cross-view learning which includes cross-view matching, cross-view fusion
and cross-view generation. Specifically, this thesis investigates two of them, cross-view matching
and cross-view generation, by developing new methods for addressing the following specific
computer vision problems.
The first problem is cross-view matching for person re-identification which a person is captured
by multiple non-overlapping camera views, the objective is to match him/her across views
among a large number of imposters. Typically a person’s appearance is represented using features
of thousands of dimensions, whilst only hundreds of training samples are available due
to the difficulties in collecting matched training samples. With the number of training samples
much smaller than the feature dimension, the existing methods thus face the classic small sample
size (SSS) problem and have to resort to dimensionality reduction techniques and/or matrix
regularisation, which lead to loss of discriminative power for cross-view matching. To that end,
this thesis proposes to overcome the SSS problem in subspace learning by matching cross-view
data in a discriminative null space of the training data.
The second problem is cross-view matching for zero-shot learning where data are drawn
from different modalities each for a different view (e.g. visual or textual), versus single-modal
data considered in the first problem. This is inherently more challenging as the gap between
different views becomes larger. Specifically, the zero-shot learning problem can be solved if
the visual representation/view of the data (object) and its textual view are matched. Moreover,
it requires learning a joint embedding space where different view data can be projected to for
nearest neighbour search. This thesis argues that the key to make zero-shot learning models succeed
is to choose the right embedding space. Different from most existing zero-shot learning
models utilising a textual or an intermediate space as the embedding space for achieving crossview
matching, the proposed method uniquely explores the visual space as the embedding space.
This thesis finds that in the visual space, the subsequent nearest neighbour search would suffer
much less from the hubness problem and thus become more effective. Moreover, a natural mechanism
for multiple textual modalities optimised jointly in an end-to-end manner in this model
demonstrates significant advantages over existing methods.
The last problem is cross-view generation for image captioning which aims to automatically
generate textual sentences from visual images. Most existing image captioning studies are limited
to investigate variants of deep learning-based image encoders, improving the inputs for the
subsequent deep sentence decoders. Existing methods have two limitations: (i) They are trained
to maximise the likelihood of each ground-truth word given the previous ground-truth words and
the image, termed Teacher-Forcing. This strategy may cause a mismatch between training and
testing since at test-time the model uses the previously generated words from the model distribution
to predict the next word. This exposure bias can result in error accumulation in sentence
generation during test time, since the model has never been exposed to its own predictions. (ii)
The training supervision metric, such as the widely used cross entropy loss, is different from
the evaluation metrics at test time. In other words, the model is not directly optimised towards
the task expectation. This learned model is therefore suboptimal. One main underlying reason
responsible is that the evaluation metrics are non-differentiable and therefore much harder to be
optimised against. This thesis overcomes the problems as above by exploring the reinforcement
learning idea. Specifically, a novel actor-critic based learning approach is formulated to directly
maximise the reward - the actual Natural Language Processing quality metrics of interest. As
compared to existing reinforcement learning based captioning models, the new method has the
unique advantage of a per-token advantage and value computation is enabled leading to better
model training