9 research outputs found
On clustering stability
JEL Classification: C100; C150; C380This work is dedicated to the evaluation of the stability of clustering solutions, namely
the stability of crisp clusterings or partitions. We specifically refer to stability as the
concordance of clusterings across several samples. In order to evaluate stability, we use
a weighted cross-validation procedure, the result of which is summarized by simple and
paired agreement indices values. To exclude the amount of agreement by chance of
these values, we propose a new method – IADJUST – that resorts to simulated crossclassification
tables. This contribution makes viable the correction of any index of
agreement.
Experiments on stability rely on 540 simulated data sets, design factors being the
number of clusters, their balance and overlap. Six real data with a priori known clusters
are also considered. The experiments conducted enable to illustrate the precision and
pertinence of the IADJUST procedure and allow to know the distribution of indices
under the hypothesis of agreement by chance. Therefore, we recommend the use of
adjusted indices to be common practice when addressing stability. We then compare the
stability of two clustering algorithms and conclude that Expectation-Maximization
(EM) results are more stable when referring to unbalanced data sets than K means
results. Finally, we explore the relationship between stability and external validity of a
clustering solution. When all experimental scenarios’ results are considered there is a
strong correlation between stability and external validity. However, within a specific
experimental scenario (when a practical clustering task is considered), we find no
relationship between stability and agreement with ground truth.Este trabalho é dedicado à avaliação da estabilidade de agrupamentos, nomeadamente
de partições. Consideramos a estabilidade como sendo a concordância dos
agrupamentos obtidos sobre diversas amostras. Para avaliar a estabilidade, usamos um
procedimento de validação cruzada ponderada, cujo resultado é resumido pelos valores
de Ãndices de concordância simples e pareados. Para excluir, destes valores, a parcela de
concordância por acaso, propomos um novo método - IADJUST - que recorre Ã
simulação de tabelas cruzadas de classificação. Essa contribuição torna viável a
correção de qualquer Ãndice de concordância.
A análise experimental da estabilidade baseia-se em 540 conjuntos de dados simulados,
controlando os números de grupos, dimensões relativas e graus de sobreposição dos
grupos. Também consideramos seis conjuntos de dados reais com classes a priori
conhecidas. As experiências realizadas permitem ilustrar a precisão e pertinência do
procedimento IADJUST e conhecer a distribuição dos Ãndices sob a hipótese de
concordância por acaso. Assim sendo, recomendamos a utilização de Ãndices ajustados
como prática comum ao abordar a estabilidade. Comparamos, então, a estabilidade de
dois algoritmos de agrupamento e concluÃmos que as soluções do algoritmo Expectation
Maximization são mais estáveis que as do K-médias em conjuntos de dados não
balanceados. Finalmente, estudamos a relação entre a estabilidade e validade externa de
um agrupamento. Agregando os resultados dos cenários experimentais obtemos uma
forte correlação entre estabilidade e validade externa. No entanto, num cenário
experimental particular (para uma tarefa prática de agrupamento), não encontramos
relação entre estabilidade e a concordância com a verdadeira estrutura dos dados
Variability analysis of the hierarchical clustering algoritms and its implication on consensus clustering
Clustering is one of the most important unsupervised learning tools when no prior knowledge about the data set is available. Clustering algorithms aim to find underlying structure of the data sets taking into account clustering criteria, properties in the data and specific way of data comparison. In the literature many clustering algorithms have been proposed having a common goal which is, given a set of objects, grouping similar objects in the same cluster and dissimilar objects in different clusters.
Hierarchical clustering algorithms are of great importance in data analysis providing knowledge about the data structure. Due to the graphical representation of the resultant partitions, through a dendrogram, may give more information than the clustering obtained by non hierarchical clustering algorithms. The use of different clustering methods for the same data set, or the use of the same clustering method but with different initializations (different parameters), can produce different clustering. So several studies have been concerned with validate the resulting clustering analyzing them in terms of stability / variability, and also, there has been an increasing interest on the problem of determining a consensus clustering.
This work empirically analyzes the clustering variability delivered by hierarchical algorithms, and some consensus clustering techniques are also investigated. By the variability of hierarchical clustering, we select the most suitable consensus clustering technique existing in literature. Results on a range of synthetic and real data sets reveal significant differences of the variability of hierarchical clustering as well as different performances of the consensus clustering techniques
Two Years of Aerosol Properties and Direct Radiative Effects Measured at a Representative Southeastern U.S. Site
The southeastern U.S. is one of only a small number of regions worldwide which has not exhibited warming over the past century. Recent studies (Goldstein et al., 2009) show that negative aerosol direct radiative effects are consistent with a warm-season regional cooling effect linked to secondary organic aerosol loading. Two years of NOAA-ESRL supported aerosol measurements made at the Appalachian Atmospheric Interdisciplinary Research (AppalAIR) facility at Appalachian State University (36.214 N, 81.693 W, 1080m ASL) are presented, along with satellite-based measurements (MODIS-Aqua) of aerosol optical depth, cloud fraction, and surface albedo. Aerosol optical property statistics are placed in the context of those made at other U.S. ESRL stations. Direct aerosol radiative effect calculations reveal high seasonal variability, with negative broadband summer forcing values of ~ -10 W/m2 (-4W/m2) when actual (standard) cloud fraction, surface albedo, and single-scattering albedo values are used. Hierarchical cluster analyses were used to broadly classify the aerosol source types that influence the Southeastern U.S. aerosol optical properties. Recently-added aerosol hygroscopic growth measurements (a sample of which are presented) will facilitate improved aerosol source type classification and aerosol light scattering humidity dependence scaling of direct radiative effect calculations
Contribution to the knowledge of hierarchical clustering algorithms and consensus clustering. Studies applied to personal recognition by hands biometrics
In exploratory data analysis, hierarchical clustering algorithms with its features can provide different clusterings when applied to the same data set. In the presence of several clusterings, each one identifying a specific data structure, consensus clustering provide a contribution to deal with this issue.
The work reported here is composed by two parts:
In the first part, we intend to explore the profile of base hierarchical clusterings, according to their variabilities, to obtain the consensus clustering. As a first result of our researches, we identified the consensus clustering technique as having better performance than the others, depending on the characteristics of hierarchical clusterings used as base. This result allows us to identify a sufficient condition for the existence of consensus clustering, as well as define a new strategy to evaluate the consensus clustering. It also leads to study a new property of hierarchical clustering algorithms.
In the second part, we explore a real-world application. In a first analysis, we use data sets derived by biometrics extracted from hands for personal recognition. We show that the hierarchical clusterings obtained by SEP/COP algorithms, can provide results with great accuracy when applied to these data sets. Furthermore, we found an increased 100% of recognition rate, comparing to the ones found in literature. In a second analysis, we consider the application of consensus clustering techniques to the problem of the identification of people's parenting by the hands biometrics. The results obtained indicate that hand’s photography has information that allows the identification of people’s family members but, according to our data, we didn't have very positive results (we observed a probability of 95% of the parents, and 94% of a sibling to be in the half of the more similar hands) that we believe it’s due to the poor quality of the
photographs we used. However, the results indicate that the technique has potential, and if the collection of photographs is made using a scanner with fixed pins, the hand may be an interesting alternative for the identification of parenting of missing children when it is applied the consensus clustering
Recommended from our members
An Evaluation of Organization Methods for Data Types Commonly Used in the Geographic Domain
This dissertation designed and implemented approaches to assess the suitability of commonly used unsupervised and supervised grouping methods on data types commonly used in the geographic domain. Four different types of data have been indexed for organization: a full-text data set depicting 30 years of cartographic literature, a raster data set consisting of physiographic characteristics of the U.S., a suite of GIS software commands used in hydrologic analysis, and a catalog of cartographic generalization algorithms. Various clustering and classification methods from the field of statistics and machine learning were evaluated for organizing these different data types. By systematically applying all types of data organization to each type of indexed data, this research addresses the question of whether certain indexing strategies influence the effectiveness of the organization methods. Depending on the data set and the indexing method applied, some clustering and classification methods performed better than others.
The experiments of this dissertation demonstrate that by the systematic evaluation and validation of clustering and classification results recommendations for organizing data can be formulated based on the results of cluster and classification indices. Furthermore, through systematic evaluation and application of the six clustering and classification methods it is possible to match indexing strategy and organization methods for each of the four data sets used in this dissertation
Who dies where, when and why? Modelling determinants and space-time risk of infant, child and adult mortality in rural South Africa, 1992-2008
Ph.D., University of the Witwatersrand, Faculty of Health Sciences, 201
The anonymous 1821 translation of Goethe's Faust :a cluster analytic approach
PhD ThesisThis study tests the hypothesis proposed by Frederick Burwick and James McKusick in
2007 that Samuel Taylor Coleridge was the author of the anonymous translation of
Goethe's Faust published by Thomas Boosey in 1821. The approach to hypothesis testing
is stylometric. Specifically, function word usage is selected as the stylometric criterion,
and 80 function words are used to define a 73-dimensional function word frequency
profile vector for each text in the corpus of Coleridge's literary works and for a selection
of works by a range of contemporary English authors. Each profile vector is a point in 80-
dimensional vector space, and cluster analytic methods are used to determine the
distribution of profile vectors in the space. If the hypothesis being tested is valid, then the
profile for the 1821 translation should be closer in the space to works known to be by
Coleridge than to works by the other authors. The cluster analytic results show, however,
that this is not the case, and the conclusion is that the Burwick and McKusick hypothesis
is falsified relative to the stylometric criterion and analytic methodology used
Cluster validation using information stability measures
In this work, a novel technique to address the problem of cluster validation based on cluster stability properties is presented. The stability index here proposed is based on the variation on some information measures over the partitions generated by a given clustering model due to the variability in clustering solutions produced by different sample sets. © 2009 Elsevier B.V. All rights reserved