Search CORE

5,351 research outputs found

Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm

Author: AK JAIN
B ROUX LE
D WISHART
F MURTAGH
F MURTAGH
F MURTAGH
F MURTAGH
Fionn Murtagh
GJ SZÉKELY
GN LANCE
JC GOWER
JH WARD
JP BENZÉCRI
L KAUFMAN
L ORLÓCI
M BRUYNOOGHE
M JAMBU
M JAMBU
MR ANDERBERG
P LEGENDRE
P LEGENDRE
Pierre Legendre
RA FISHER
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/12/2011
Field of study

The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.Comment: 20 pages, 21 citations, 4 figure

arXiv.org e-Print Archive

Goldsmiths Research Online

Crossref

De Montfort University Open Research Archive

Recognizing Treelike k-Dissimilarities

Author: A Schrijver
AD Gordon
Andreas Spillner
AWM Dress
AWM Dress
C Bocci
C Hayashi
D Levy
DP Faith
E Rubei
G Soete de
H-J Bandelt
H-J Bandelt
J Culberson
J Felsenstein
K Zaretsky
Katharina T. Huber
L Pachter
M Steel
M-M Deza
MJ Warrens
N Grishin
S Joly
Sven Herrmann
Vincent Moulton
WJ Heiser
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

A k-dissimilarity D on a finite set X, |X| >= k, is a map from the set of size k subsets of X to the real numbers. Such maps naturally arise from edge-weighted trees T with leaf-set X: Given a subset Y of X of size k, D(Y) is defined to be the total length of the smallest subtree of T with leaf-set Y . In case k = 2, it is well-known that 2-dissimilarities arising in this way can be characterized by the so-called "4-point condition". However, in case k > 2 Pachter and Speyer recently posed the following question: Given an arbitrary k-dissimilarity, how do we test whether this map comes from a tree? In this paper, we provide an answer to this question, showing that for k >= 3 a k-dissimilarity on a set X arises from a tree if and only if its restriction to every 2k-element subset of X arises from some tree, and that 2k is the least possible subset size to ensure that this is the case. As a corollary, we show that there exists a polynomial-time algorithm to determine when a k-dissimilarity arises from a tree. We also give a 6-point condition for determining when a 3-dissimilarity arises from a tree, that is similar to the aforementioned 4-point condition.Comment: 18 pages, 4 figure

arXiv.org e-Print Archive

Crossref

University of East Anglia digital repository

Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

Author: AP Reynolds
C Lucasius
E Schubert
H Bock
H Kriegel
H Park
Leonard Kaufman
ML Overton
RT Ng
V Estivill-Castro
V Estivill-Castro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/10/2019
Field of study

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or more complex distances. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (at comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now also explore alternative strategies for choosing the initial medoids. We also show how the CLARA and CLARANS algorithms benefit from these modifications. It can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100, we observed a 200-fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets as long as we can afford to compute a distance matrix, and in particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the speedup is expected to increase with k)

arXiv.org e-Print Archive

Crossref

Interactive genetic algorithm for user-centered design of distributed conservation practices in a watershed: An examination of user preferences in objective space and user behavior

Author: Babbar-Sebens Meghna
Kleinberg Austin
Mukhopadhyay Snehasis
Piemonti Adriana Debora
Publication venue: 'Wiley'
Publication date: 01/05/2017
Field of study

Interactive Genetic Algorithms (IGA) are advanced human-in-the-loop optimization methods that enable humans to give feedback, based on their subjective and unquantified preferences and knowledge, during the algorithm's search process. While these methods are gaining popularity in multiple fields, there is a critical lack of data and analyses on (a) the nature of interactions of different humans with interfaces of decision support systems (DSS) that employ IGA in water resources planning problems and on (b) the effect of human feedback on the algorithm's ability to search for design alternatives desirable to end-users. In this paper, we present results and analyses of observational experiments in which different human participants (surrogates and stakeholders) interacted with an IGA-based, watershed DSS called WRESTORE to identify plans of conservation practices in a watershed. The main goal of this paper is to evaluate how the IGA adapts its search process in the objective space to a user's feedback, and identify whether any similarities exist in the objective space of plans found by different participants. Some participants focused on the entire watershed, while others focused only on specific local subbasins. Additionally, two different hydrology models were used to identify any potential differences in interactive search outcomes that could arise from differences in the numerical values of benefits displayed to participants. Results indicate that stakeholders, in comparison to their surrogates, were more likely to use multiple features of the DSS interface to collect information before giving feedback, and dissimilarities existed among participants in the objective space of design alternatives

IUPUIScholarWorks