5,351 research outputs found
Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm
The Ward error sum of squares hierarchical clustering method has been very
widely used since its first description by Ward in a 1963 publication. It has
also been generalized in various ways. However there are different
interpretations in the literature and there are different implementations of
the Ward agglomerative algorithm in commonly used software systems, including
differing expressions of the agglomerative criterion. Our survey work and case
studies will be useful for all those involved in developing software for data
analysis using Ward's hierarchical clustering method.Comment: 20 pages, 21 citations, 4 figure
Recognizing Treelike k-Dissimilarities
A k-dissimilarity D on a finite set X, |X| >= k, is a map from the set of
size k subsets of X to the real numbers. Such maps naturally arise from
edge-weighted trees T with leaf-set X: Given a subset Y of X of size k, D(Y) is
defined to be the total length of the smallest subtree of T with leaf-set Y .
In case k = 2, it is well-known that 2-dissimilarities arising in this way can
be characterized by the so-called "4-point condition". However, in case k > 2
Pachter and Speyer recently posed the following question: Given an arbitrary
k-dissimilarity, how do we test whether this map comes from a tree? In this
paper, we provide an answer to this question, showing that for k >= 3 a
k-dissimilarity on a set X arises from a tree if and only if its restriction to
every 2k-element subset of X arises from some tree, and that 2k is the least
possible subset size to ensure that this is the case. As a corollary, we show
that there exists a polynomial-time algorithm to determine when a
k-dissimilarity arises from a tree. We also give a 6-point condition for
determining when a 3-dissimilarity arises from a tree, that is similar to the
aforementioned 4-point condition.Comment: 18 pages, 4 figure
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
Clustering non-Euclidean data is difficult, and one of the most used
algorithms besides hierarchical clustering is the popular algorithm
Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In
Euclidean geometry the mean-as used in k-means-is a good estimator for the
cluster center, but this does not hold for arbitrary dissimilarities. PAM uses
the medoid instead, the object with the smallest dissimilarity to all others in
the cluster. This notion of centrality can be used with any (dis-)similarity,
and thus is of high relevance to many domains such as biology that require the
use of Jaccard, Gower, or more complex distances.
A key issue with PAM is its high run time cost. We propose modifications to
the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of
the algorithm, but will still find the same results as the original PAM
algorithm. If we slightly relax the choice of swaps performed (at comparable
quality), we can further accelerate the algorithm by performing up to k swaps
in each iteration. With the substantially faster SWAP, we can now also explore
alternative strategies for choosing the initial medoids. We also show how the
CLARA and CLARANS algorithms benefit from these modifications. It can easily be
combined with earlier approaches to use PAM and CLARA on big data (some of
which use PAM as a subroutine, hence can immediately benefit from these
improvements), where the performance with high k becomes increasingly
important.
In experiments on real data with k=100, we observed a 200-fold speedup
compared to the original PAM SWAP algorithm, making PAM applicable to larger
data sets as long as we can afford to compute a distance matrix, and in
particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the
speedup is expected to increase with k)
Interactive genetic algorithm for user-centered design of distributed conservation practices in a watershed: An examination of user preferences in objective space and user behavior
Interactive Genetic Algorithms (IGA) are advanced human-in-the-loop optimization methods that enable humans to give feedback, based on their subjective and unquantified preferences and knowledge, during the algorithm's search process. While these methods are gaining popularity in multiple fields, there is a critical lack of data and analyses on (a) the nature of interactions of different humans with interfaces of decision support systems (DSS) that employ IGA in water resources planning problems and on (b) the effect of human feedback on the algorithm's ability to search for design alternatives desirable to end-users. In this paper, we present results and analyses of observational experiments in which different human participants (surrogates and stakeholders) interacted with an IGA-based, watershed DSS called WRESTORE to identify plans of conservation practices in a watershed. The main goal of this paper is to evaluate how the IGA adapts its search process in the objective space to a user's feedback, and identify whether any similarities exist in the objective space of plans found by different participants. Some participants focused on the entire watershed, while others focused only on specific local subbasins. Additionally, two different hydrology models were used to identify any potential differences in interactive search outcomes that could arise from differences in the numerical values of benefits displayed to participants. Results indicate that stakeholders, in comparison to their surrogates, were more likely to use multiple features of the DSS interface to collect information before giving feedback, and dissimilarities existed among participants in the objective space of design alternatives
- …