Search CORE

1,426 research outputs found

Recovering the number of clusters in data sets with noise features using feature rescaling factors

Author: Arbelaitz
Ball
Bezdek
Caliński
Chan
Chiang
Chiang
Christian Hennig
David
de Amorim
de Amorim
de Amorim
Dudoit
Dunn
Gasch
Halkidi
Hartigan
Hennig
Huang
Huang
Hubert
Jain
Jain
Kaufman
MacQueen
Milligan
Mirkin
Pollard
Renato Cordeiro de Amorim
Rousseeuw
Steinley
Steinley
Sturn
Vedaldi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the pth power of the Minkowski distance), Dunn’s, Calinski–Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.Peer reviewe

arXiv.org e-Print Archive

University of Essex Research Repository

Crossref

UCL Discovery

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

University of Hertfordshire Research Archive

A-Wardpβ: effective hierarchical clustering using the Minkowski metric and a fast k-means initialisation

Author: Arbelaitz
Ball
Bezdek
Boris Mirkin
Bradley
Cao
Celebi
Chiang
de Amorim
de Amorim
de Amorim
de Amorim
Dempster
Eppstein
Fraley
Freytag
Hubert
Jain
Juan
Kriegel
Leiva
MacQueen
Makarenkov
Maldonado
Milligan
Mirkin
Monni
Murtagh
Murtagh
Murtagh
Pena
Renato Cordeiro de Amorim
Rousseeuw
Steinbach
Steinley
Su
Tan
Vladimir Makarenkov
Ward
Wilcox
Zadeh
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. This allows the cluster merging process to start from this partition rather than from a trivial partition composed solely of singletons. Our second contribution is an extension of the Ward and Wardp algorithms to the situation where the feature weight exponent can differ from the exponent of the Minkowski distance. This new method, called A-Wardpβ, is able to generate a much wider variety of clustering solutions. We also demonstrate that its parameters can be estimated reasonably well by using a cluster validity index. We perform numerous experiments using data sets with two types of noise, insertion of noise features and blurring within-cluster values of some features. These experiments allow us to conclude: (i) our anomalous pattern initialisation method does indeed reduce the time a hierarchical clustering algorithm takes to complete, without negatively impacting its cluster recovery ability; (ii) A-Wardpβ provides better cluster recovery than both Ward and Wardp

arXiv.org e-Print Archive

University of Essex Research Repository

Crossref

Birkbeck Institutional Research Online

University of Hertfordshire Research Archive

The Minkowski central partition as a pointer to a suitable distance exponent and consensus partitioning

Author: Andrei Shestakov
Arbelaitz
Bertoni
Boris Mirkin
Caliński
Chan
de Amorim
de Amorim
de Amorim
de Amorim
Field
Hadjitodorov
Hartigan
Huang
Huang
Huang
Hubert
Jain
Ji
Kuncheva
Legendre
MacQueen
Makarenkov
MATLAB
Milligan
Mirkin
Mirkin
Ng
Pividori
Pollard
Renato Cordeiro de Amorim
Rousseeuw
Saitou
Steinley
Steinley
Topchy
Vladimir Makarenkov
Von Luxburg
Yang
Publication venue: 'Elsevier BV'
Publication date: 02/02/2017
Field of study

The Minkowski weighted K-means (MWK-means) is a recently developed clustering algorithm capable of computing feature weights. The cluster-specific weights in MWK-means follow the intuitive idea that a feature with low variance should have a greater weight than a feature with high variance. The final clustering found by this algorithm depends on the selection of the Minkowski distance exponent. This paper explores the possibility of using the central Minkowski partition in the ensemble of all Minkowski partitions for selecting an optimal value of the Minkowski exponent. The central Minkowski partition appears to be also a good consensus partition. Furthermore, we discovered some striking correlation results between the Minkowski profile, defined as a mapping of the Minkowski exponent values into the average similarity values of the optimal Minkowski partitions, and the Adjusted Rand Index vectors resulting from the comparison of the obtained partitions to the ground truth. Our findings were confirmed by a series of computational experiments involving synthetic Gaussian clusters and real-world data

University of Essex Research Repository

Crossref

Birkbeck Institutional Research Online

University of Hertfordshire Research Archive

Clustering Partially Observed Graphs via Convex Optimization

Author: Chen Yudong
Jalali Ali
Sanghavi Sujay
Xu Huan
Publication venue
Publication date: 01/01/2014
Field of study

This paper considers the problem of clustering a partially observed unweighted graph---i.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements"---i.e., the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) low-rank matrix and an (unknown) sparse matrix from their partially observed sum. We evaluate the performance of our algorithm on the classical Planted Partition/Stochastic Block Model. Our main theorem provides sufficient conditions for the success of our algorithm as a function of the minimum cluster size, edge density and observation probability; in particular, the results characterize the tradeoff between the observation probability and the edge density gap. When there are a constant number of clusters of equal size, our results are optimal up to logarithmic factors.Comment: This is the final version published in Journal of Machine Learning Research (JMLR). Partial results appeared in International Conference on Machine Learning (ICML) 201

arXiv.org e-Print Archive

CiteSeerX

Self-similar aftershock rates

Author: Baiesi Marco
Davidsen Jörn
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2016
Field of study

In many important systems exhibiting crackling noise --- intermittent avalanche-like relaxation response with power-law and, thus, self-similar distributed event sizes --- the "laws" for the rate of activity after large events are not consistent with the overall self-similar behavior expected on theoretical grounds. This is in particular true for the case of seismicity and a satisfying solution to this paradox has remained outstanding. Here, we propose a generalized description of the aftershock rates which is both self-similar and consistent with all other known self-similar features. Comparing our theoretical predictions with high resolution earthquake data from Southern California we find excellent agreement, providing in particular clear evidence for a unified description of aftershocks and foreshocks. This may offer an improved way of time-dependent seismic hazard assessment and earthquake forecasting

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Padova

One-point fluctuation analysis of the high-energy neutrino sky

Author: Ando Shin'ichiro
Feyereisen Michael R.
Tamborra Irene
Publication venue: 'IOP Publishing'
Publication date: 01/03/2017
Field of study

We perform the first one-point fluctuation analysis of the high-energy neutrino sky. This method reveals itself to be especially suited to contemporary neutrino data, as it allows to study the properties of the astrophysical components of the high-energy flux detected by the IceCube telescope, even with low statistics and in the absence of point source detection. Besides the veto-passing atmospheric foregrounds, we adopt a simple model of the high-energy neutrino background by assuming two main extra-galactic components: star-forming galaxies and blazars. By leveraging multi-wavelength data from Herschel and Fermi, we predict the spectral and anisotropic probability distributions for their expected neutrino counts in IceCube. We find that star-forming galaxies are likely to remain a diffuse background due to the poor angular resolution of IceCube, and we determine an upper limit on the number of shower events that can reasonably be associated to blazars. We also find that upper limits on the contribution of blazars to the measured flux are unfavourably affected by the skewness of the blazar flux distribution. One-point event clustering and likelihood analyses of the IceCube HESE data suggest that this method has the potential to dramatically improve over more conventional model-based analyses, especially for the next generation of neutrino telescopes.Comment: 41 pages, 6 figures, 2 tables; different blazar model than v1 but same result

arXiv.org e-Print Archive

Copenhagen University Research Information System

UvA-DARE

International Migration, Integration and Social Cohesion online publications