thesis

New Algorithms andMethodology for Analysing Distances

Abstract

Distances arise in a wide variety of di�erent contexts, one of which is partitional clustering, that is, the problem of �nding groups of similar objects within a set of objects.¿ese groups are seemingly very easy to �nd for humans, but very di�cult to �nd for machines as there are two major di�culties to be overcome: the �rst de�ning an objective criterion for the vague notion of “groups of similar objects”, and the second is the computational complexity of �nding such groups given a criterion. In the �rst part of this thesis, we focus on the �rst di�culty and show that even seemingly similar optimisation criteria used for partitional clustering can produce vastly di�erent results. In the process of showing this we develop a new metric for comparing clustering solutions called the assignment metric. We then prove some new NP-completeness results for problems using two related “sum-of-squares” clustering criteria. Closely related to partitional clustering is the problem of hierarchical clustering. We extend and formalise this problem to the problem of constructing rooted edge-weighted X-trees, that is trees with a leafset X. It is well known that an X-tree can be uniquely reconstructed from a distance on X if the distance is an ultrametric. But in practice the complete distance on X may not always be available. In the second part of this thesis we look at some of the circumstances under which a tree can be uniquely reconstructed from incomplete distance information. We use a concept called a lasso and give some theoretical properties of a special type of lasso. We then develop an algorithm which can construct a tree together with a lasso from partial distance information and show how this can be applied to various incomplete datasets

    Similar works