This thesis is concerned with improving the effectiveness of nearest neighbour search.
Nearest neighbour search is the problem of finding the most similar data-points to a
query in a database, and is a fundamental operation that has found wide applicability
in many fields. In this thesis the focus is placed on hashing-based approximate
nearest neighbour search methods that generate similar binary hashcodes for similar
data-points. These hashcodes can be used as the indices into the buckets of hashtables
for fast search. This work explores how the quality of search can be improved by
learning task specific binary hashcodes.
The generation of a binary hashcode comprises two main steps carried out sequentially:
projection of the image feature vector onto the normal vectors of a set of hyperplanes
partitioning the input feature space followed by a quantisation operation that
uses a single threshold to binarise the resulting projections to obtain the hashcodes.
The degree to which these operations preserve the relative distances between the datapoints
in the input feature space has a direct influence on the effectiveness of using
the resulting hashcodes for nearest neighbour search. In this thesis I argue that the
retrieval effectiveness of existing hashing-based nearest neighbour search methods can
be increased by learning the thresholds and hyperplanes based on the distribution of
the input data.
The first contribution is a model for learning multiple quantisation thresholds. I
demonstrate that the best threshold positioning is projection specific and introduce a
novel clustering algorithm for threshold optimisation. The second contribution extends
this algorithm by learning the optimal allocation of quantisation thresholds per hyperplane.
In doing so I argue that some hyperplanes are naturally more effective than others
at capturing the distribution of the data and should therefore attract a greater allocation
of quantisation thresholds. The third contribution focuses on the complementary
problem of learning the hashing hyperplanes. I introduce a multi-step iterative model
that, in the first step, regularises the hashcodes over a data-point adjacency graph,
which encourages similar data-points to be assigned similar hashcodes. In the second
step, binary classifiers are learnt to separate opposing bits with maximum margin. This
algorithm is extended to learn hyperplanes that can generate similar hashcodes for similar
data-points in two different feature spaces (e.g. text and images). Individually the
performance of these algorithms is often superior to competitive baselines. I unify my
contributions by demonstrating that learning hyperplanes and thresholds as part of the
same model can yield an additive increase in retrieval effectiveness