In this thesis, we propose a new algorithm which automatically detects the number of
clusters in a tree structure data set by denoising some generalized node values in the tree
using lifting “one coefficient at a time” (LOCAAT) algorithm introduced by Jansen et al.
(2001). Our algorithm can be applied to any multidimensional data set using compactness
value as a node value or to phylogenetic data sets, DNA sequences, using either compactness
value or dissimilarity score as a node value. Compactness value is defined as the
average distance from the centroid of each possible cluster in the tree, and the dissimilarity
score is the average number of loci, where at least one of them does not share the same
nucleotide between sequences under the node of interest.
For multidimensional data sets, we consider each node in the tree as a possible location
of a cluster after denoising the tree by LOCAAT. Thus, for each possible cluster, we check
how much departure we can allow from the centroid of the cluster to assign the objects
under the node of interest as a cluster. Then if a node and all its child nodes are denoised
less than or equal to the allowed amount of departure from the centroid of their clusters,
a cluster is located at this node. We also propose another version of our algorithm based
on non-decimated lifting (Knight & Nason, 2009) in which we generate a probability of
being clustered for each node. If a node and all its child nodes have a probability of being
clustered less than or equal to the probability of acceptance, θ∈[0; 1], a cluster is located
at this node. We provide a comparison study between our algorithms and some available
internal cluster validity indices (CVIs) in the literature using some artificial data sets and
a real data set. In addition, we compare the performance of each method using some
available external cluster validity scores.
For phylogenetic data sets, we check the performance of our algorithms and other
CVIs using both compactness value and dissimilarity score as a node value. To be able to
compute compactness value for a phylogenetic tree, we need to find the position of each
specie in Rp using multidimensional scaling (MDS), and then we can find which species
share the similar features using our algorithm. If we use the dissimilarity score as a node
value, we will cluster similar species together by finding how much difference we can
allow between species. We check the performance of our algorithms using some artificial
and a real data sets.
In the final part of our thesis, we propose a visualization tool for cophylogenetic data
sets. We only consider the associated two phylogenetic trees case, and we apply our algorithm to both host and parasite trees separately to provide a summary of these data
sets. We check the performance of our algorithm using two well-known cophylogenetic
data sets