326 research outputs found
Minimizing the average distance to a closest leaf in a phylogenetic tree
When performing an analysis on a collection of molecular sequences, it can be
convenient to reduce the number of sequences under consideration while
maintaining some characteristic of a larger collection of sequences. For
example, one may wish to select a subset of high-quality sequences that
represent the diversity of a larger collection of sequences. One may also wish
to specialize a large database of characterized "reference sequences" to a
smaller subset that is as close as possible on average to a collection of
"query sequences" of interest. Such a representative subset can be useful
whenever one wishes to find a set of reference sequences that is appropriate to
use for comparative analysis of environmentally-derived sequences, such as for
selecting "reference tree" sequences for phylogenetic placement of metagenomic
reads. In this paper we formalize these problems in terms of the minimization
of the Average Distance to the Closest Leaf (ADCL) and investigate algorithms
to perform the relevant minimization. We show that the greedy algorithm is not
effective, show that a variant of the Partitioning Among Medoids (PAM)
heuristic gets stuck in local minima, and develop an exact dynamic programming
approach. Using this exact program we note that the performance of PAM appears
to be good for simulated trees, and is faster than the exact algorithm for
small trees. On the other hand, the exact program gives solutions for all
numbers of leaves less than or equal to the given desired number of leaves,
while PAM only gives a solution for the pre-specified number of leaves. Via
application to real data, we show that the ADCL criterion chooses chimeric
sequences less often than random subsets, while the maximization of
phylogenetic diversity chooses them more often than random. These algorithms
have been implemented in publicly available software.Comment: Please contact us with any comments or questions
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
Clustering non-Euclidean data is difficult, and one of the most used
algorithms besides hierarchical clustering is the popular algorithm
Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In
Euclidean geometry the mean-as used in k-means-is a good estimator for the
cluster center, but this does not hold for arbitrary dissimilarities. PAM uses
the medoid instead, the object with the smallest dissimilarity to all others in
the cluster. This notion of centrality can be used with any (dis-)similarity,
and thus is of high relevance to many domains such as biology that require the
use of Jaccard, Gower, or more complex distances.
A key issue with PAM is its high run time cost. We propose modifications to
the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of
the algorithm, but will still find the same results as the original PAM
algorithm. If we slightly relax the choice of swaps performed (at comparable
quality), we can further accelerate the algorithm by performing up to k swaps
in each iteration. With the substantially faster SWAP, we can now also explore
alternative strategies for choosing the initial medoids. We also show how the
CLARA and CLARANS algorithms benefit from these modifications. It can easily be
combined with earlier approaches to use PAM and CLARA on big data (some of
which use PAM as a subroutine, hence can immediately benefit from these
improvements), where the performance with high k becomes increasingly
important.
In experiments on real data with k=100, we observed a 200-fold speedup
compared to the original PAM SWAP algorithm, making PAM applicable to larger
data sets as long as we can afford to compute a distance matrix, and in
particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the
speedup is expected to increase with k)
Times series averaging from a probabilistic interpretation of time-elastic kernel
At the light of regularized dynamic time warping kernels, this paper
reconsider the concept of time elastic centroid (TEC) for a set of time series.
From this perspective, we show first how TEC can easily be addressed as a
preimage problem. Unfortunately this preimage problem is ill-posed, may suffer
from over-fitting especially for long time series and getting a sub-optimal
solution involves heavy computational costs. We then derive two new algorithms
based on a probabilistic interpretation of kernel alignment matrices that
expresses in terms of probabilistic distributions over sets of alignment paths.
The first algorithm is an iterative agglomerative heuristics inspired from the
state of the art DTW barycenter averaging (DBA) algorithm proposed specifically
for the Dynamic Time Warping measure. The second proposed algorithm achieves a
classical averaging of the aligned samples but also implements an averaging of
the time of occurrences of the aligned samples. It exploits a straightforward
progressive agglomerative heuristics. An experimentation that compares for 45
time series datasets classification error rates obtained by first near
neighbors classifiers exploiting a single medoid or centroid estimate to
represent each categories show that: i) centroids based approaches
significantly outperform medoids based approaches, ii) on the considered
experience, the two proposed algorithms outperform the state of the art DBA
algorithm, and iii) the second proposed algorithm that implements an averaging
jointly in the sample space and along the time axes emerges as the most
significantly robust time elastic averaging heuristic with an interesting noise
reduction capability. Index Terms-Time series averaging Time elastic kernel
Dynamic Time Warping Time series clustering and classification
Sparse Partitioning Around Medoids
Partitioning Around Medoids (PAM, k-Medoids) is a popular clustering
technique to use with arbitrary distance functions or similarities, where each
cluster is represented by its most central object, called the medoid or the
discrete median. In operations research, this family of problems is also known
as facility location problem (FLP). FastPAM recently introduced a speedup for
large k to make it applicable for larger problems, but the method still has a
runtime quadratic in N. In this chapter, we discuss a sparse and asymmetric
variant of this problem, to be used for example on graph data such as road
networks. By exploiting sparsity, we can avoid the quadratic runtime and memory
requirements, and make this method scalable to even larger problems, as long as
we are able to build a small enough graph of sufficient connectivity to perform
local optimization. Furthermore, we consider asymmetric cases, where the set of
medoids is not identical to the set of points to be covered (or in the
interpretation of facility location, where the possible facility locations are
not identical to the consumer locations). Because of sparsity, it may be
impossible to cover all points with just k medoids for too small k, which would
render the problem unsolvable, and this breaks common heuristics for finding a
good starting condition. We, hence, consider determining k as a part of the
optimization problem and propose to first construct a greedy initial solution
with a larger k, then to optimize the problem by alternating between PAM-style
"swap" operations where the result is improved by replacing medoids with better
alternatives and "remove" operations to reduce the number of k until neither
allows further improving the result quality. We demonstrate the usefulness of
this method on a problem from electrical engineering, with the input graph
derived from cartographic data
BanditPAM++: Faster -medoids Clustering
Clustering is a fundamental task in data science with wide-ranging
applications. In -medoids clustering, cluster centers must be actual
datapoints and arbitrary distance metrics may be used; these features allow for
greater interpretability of the cluster centers and the clustering of exotic
objects in -medoids clustering, respectively. -medoids clustering has
recently grown in popularity due to the discovery of more efficient -medoids
algorithms. In particular, recent research has proposed BanditPAM, a randomized
-medoids algorithm with state-of-the-art complexity and clustering accuracy.
In this paper, we present BanditPAM++, which accelerates BanditPAM via two
algorithmic improvements, and is faster than BanditPAM in complexity and
substantially faster than BanditPAM in wall-clock runtime. First, we
demonstrate that BanditPAM has a special structure that allows the reuse of
clustering information each iteration. Second, we demonstrate
that BanditPAM has additional structure that permits the reuse of information
different iterations. These observations inspire our proposed
algorithm, BanditPAM++, which returns the same clustering solutions as
BanditPAM but often several times faster. For example, on the CIFAR10 dataset,
BanditPAM++ returns the same results as BanditPAM but runs over 10
faster. Finally, we provide a high-performance C++ implementation of
BanditPAM++, callable from Python and R, that may be of interest to
practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to
reproduce all of our experiments via a one-line script is available at
https://github.com/ThrunGroup/BanditPAM_plusplus_experiments.Comment: NeurIPS 202
- …