Search CORE

9 research outputs found

On clustering stability

Author: Amorim Maria José de Pina da Cruz
Publication venue
Publication date: 01/01/2016
Field of study

JEL Classification: C100; C150; C380This work is dedicated to the evaluation of the stability of clustering solutions, namely the stability of crisp clusterings or partitions. We specifically refer to stability as the concordance of clusterings across several samples. In order to evaluate stability, we use a weighted cross-validation procedure, the result of which is summarized by simple and paired agreement indices values. To exclude the amount of agreement by chance of these values, we propose a new method – IADJUST – that resorts to simulated crossclassification tables. This contribution makes viable the correction of any index of agreement. Experiments on stability rely on 540 simulated data sets, design factors being the number of clusters, their balance and overlap. Six real data with a priori known clusters are also considered. The experiments conducted enable to illustrate the precision and pertinence of the IADJUST procedure and allow to know the distribution of indices under the hypothesis of agreement by chance. Therefore, we recommend the use of adjusted indices to be common practice when addressing stability. We then compare the stability of two clustering algorithms and conclude that Expectation-Maximization (EM) results are more stable when referring to unbalanced data sets than K means results. Finally, we explore the relationship between stability and external validity of a clustering solution. When all experimental scenarios’ results are considered there is a strong correlation between stability and external validity. However, within a specific experimental scenario (when a practical clustering task is considered), we find no relationship between stability and agreement with ground truth.Este trabalho é dedicado à avaliação da estabilidade de agrupamentos, nomeadamente de partições. Consideramos a estabilidade como sendo a concordância dos agrupamentos obtidos sobre diversas amostras. Para avaliar a estabilidade, usamos um procedimento de validação cruzada ponderada, cujo resultado é resumido pelos valores de índices de concordância simples e pareados. Para excluir, destes valores, a parcela de concordância por acaso, propomos um novo método - IADJUST - que recorre à simulação de tabelas cruzadas de classificação. Essa contribuição torna viável a correção de qualquer índice de concordância. A análise experimental da estabilidade baseia-se em 540 conjuntos de dados simulados, controlando os números de grupos, dimensões relativas e graus de sobreposição dos grupos. Também consideramos seis conjuntos de dados reais com classes a priori conhecidas. As experiências realizadas permitem ilustrar a precisão e pertinência do procedimento IADJUST e conhecer a distribuição dos índices sob a hipótese de concordância por acaso. Assim sendo, recomendamos a utilização de índices ajustados como prática comum ao abordar a estabilidade. Comparamos, então, a estabilidade de dois algoritmos de agrupamento e concluímos que as soluções do algoritmo Expectation Maximization são mais estáveis que as do K-médias em conjuntos de dados não balanceados. Finalmente, estudamos a relação entre a estabilidade e validade externa de um agrupamento. Agregando os resultados dos cenários experimentais obtemos uma forte correlação entre estabilidade e validade externa. No entanto, num cenário experimental particular (para uma tarefa prática de agrupamento), não encontramos relação entre estabilidade e a concordância com a verdadeira estrutura dos dados

Repositório Institucional do ISCTE-IUL

Variability analysis of the hierarchical clustering algoritms and its implication on consensus clustering

Author: Sousa L. (Lucia)
Publication venue: 'Arunai Publications Private Limited'
Publication date: 01/06/2017
Field of study

Clustering is one of the most important unsupervised learning tools when no prior knowledge about the data set is available. Clustering algorithms aim to find underlying structure of the data sets taking into account clustering criteria, properties in the data and specific way of data comparison. In the literature many clustering algorithms have been proposed having a common goal which is, given a set of objects, grouping similar objects in the same cluster and dissimilar objects in different clusters. Hierarchical clustering algorithms are of great importance in data analysis providing knowledge about the data structure. Due to the graphical representation of the resultant partitions, through a dendrogram, may give more information than the clustering obtained by non hierarchical clustering algorithms. The use of different clustering methods for the same data set, or the use of the same clustering method but with different initializations (different parameters), can produce different clustering. So several studies have been concerned with validate the resulting clustering analyzing them in terms of stability / variability, and also, there has been an increasing interest on the problem of determining a consensus clustering. This work empirically analyzes the clustering variability delivered by hierarchical algorithms, and some consensus clustering techniques are also investigated. By the variability of hierarchical clustering, we select the most suitable consensus clustering technique existing in literature. Results on a range of synthetic and real data sets reveal significant differences of the variability of hierarchical clustering as well as different performances of the consensus clustering techniques

Neliti

Two Years of Aerosol Properties and Direct Radiative Effects Measured at a Representative Southeastern U.S. Site

Author: Beuttell William Bullitt
NC DOCKS at Appalachian State University
Publication venue
Publication date: 01/01/2011
Field of study

The southeastern U.S. is one of only a small number of regions worldwide which has not exhibited warming over the past century. Recent studies (Goldstein et al., 2009) show that negative aerosol direct radiative effects are consistent with a warm-season regional cooling effect linked to secondary organic aerosol loading. Two years of NOAA-ESRL supported aerosol measurements made at the Appalachian Atmospheric Interdisciplinary Research (AppalAIR) facility at Appalachian State University (36.214 N, 81.693 W, 1080m ASL) are presented, along with satellite-based measurements (MODIS-Aqua) of aerosol optical depth, cloud fraction, and surface albedo. Aerosol optical property statistics are placed in the context of those made at other U.S. ESRL stations. Direct aerosol radiative effect calculations reveal high seasonal variability, with negative broadband summer forcing values of ~ -10 W/m2 (-4W/m2) when actual (standard) cloud fraction, surface albedo, and single-scattering albedo values are used. Hierarchical cluster analyses were used to broadly classify the aerosol source types that influence the Southeastern U.S. aerosol optical properties. Recently-added aerosol hygroscopic growth measurements (a sample of which are presented) will facilitate improved aerosol source type classification and aerosol light scattering humidity dependence scaling of direct radiative effect calculations

The University of North Carolina at Greensboro

Contribution to the knowledge of hierarchical clustering algorithms and consensus clustering. Studies applied to personal recognition by hands biometrics

Author: Sousa Lúcia
Publication venue
Publication date: 02/04/2015
Field of study

In exploratory data analysis, hierarchical clustering algorithms with its features can provide different clusterings when applied to the same data set. In the presence of several clusterings, each one identifying a specific data structure, consensus clustering provide a contribution to deal with this issue. The work reported here is composed by two parts: In the first part, we intend to explore the profile of base hierarchical clusterings, according to their variabilities, to obtain the consensus clustering. As a first result of our researches, we identified the consensus clustering technique as having better performance than the others, depending on the characteristics of hierarchical clusterings used as base. This result allows us to identify a sufficient condition for the existence of consensus clustering, as well as define a new strategy to evaluate the consensus clustering. It also leads to study a new property of hierarchical clustering algorithms. In the second part, we explore a real-world application. In a first analysis, we use data sets derived by biometrics extracted from hands for personal recognition. We show that the hierarchical clusterings obtained by SEP/COP algorithms, can provide results with great accuracy when applied to these data sets. Furthermore, we found an increased 100% of recognition rate, comparing to the ones found in literature. In a second analysis, we consider the application of consensus clustering techniques to the problem of the identification of people's parenting by the hands biometrics. The results obtained indicate that hand’s photography has information that allows the identification of people’s family members but, according to our data, we didn't have very positive results (we observed a probability of 95% of the parents, and 94% of a sibling to be in the half of the more similar hands) that we believe it’s due to the poor quality of the photographs we used. However, the results indicate that the technique has potential, and if the collection of photographs is made using a scanner with fixed pins, the hand may be an interesting alternative for the identification of parenting of missing children when it is applied the consensus clustering

Repositório Científico do Instituto Politécnico de Viseu

Recommended from our members

An Evaluation of Organization Methods for Data Types Commonly Used in the Geographic Domain

Author: Wendel Jochen
Publication venue: University of Colorado Boulder
Publication date: 01/01/2013
Field of study

This dissertation designed and implemented approaches to assess the suitability of commonly used unsupervised and supervised grouping methods on data types commonly used in the geographic domain. Four different types of data have been indexed for organization: a full-text data set depicting 30 years of cartographic literature, a raster data set consisting of physiographic characteristics of the U.S., a suite of GIS software commands used in hydrologic analysis, and a catalog of cartographic generalization algorithms. Various clustering and classification methods from the field of statistics and machine learning were evaluated for organizing these different data types. By systematically applying all types of data organization to each type of indexed data, this research addresses the question of whether certain indexing strategies influence the effectiveness of the organization methods. Depending on the data set and the indexing method applied, some clustering and classification methods performed better than others. The experiments of this dissertation demonstrate that by the systematic evaluation and validation of clustering and classification results recommendations for organizing data can be formulated based on the results of cluster and classification indices. Furthermore, through systematic evaluation and application of the six clustering and classification methods it is possible to match indexing strategy and organization methods for each of the four data sets used in this dissertation

CU Scholar Institutional Repository

Who dies where, when and why? Modelling determinants and space-time risk of infant, child and adult mortality in rural South Africa, 1992-2008

Author: Sartorius Benn
Publication venue
Publication date: 18/05/2012
Field of study

Ph.D., University of the Witwatersrand, Faculty of Health Sciences, 201

Wits Institutional Repository on DSPACE

The anonymous 1821 translation of Goethe's Faust :a cluster analytic approach

Author: Aljumily Refat
Publication venue: Newcastle University
Publication date: 01/01/2015
Field of study

PhD ThesisThis study tests the hypothesis proposed by Frederick Burwick and James McKusick in 2007 that Samuel Taylor Coleridge was the author of the anonymous translation of Goethe's Faust published by Thomas Boosey in 1821. The approach to hypothesis testing is stylometric. Specifically, function word usage is selected as the stylometric criterion, and 80 function words are used to define a 73-dimensional function word frequency profile vector for each text in the corpus of Coleridge's literary works and for a selection of works by a range of contemporary English authors. Each profile vector is a point in 80- dimensional vector space, and cluster analytic methods are used to determine the distribution of profile vectors in the space. If the hypothesis being tested is valid, then the profile for the 1821 translation should be closer in the space to works known to be by Coleridge than to works by the other authors. The cluster analytic results show, however, that this is not the case, and the conclusion is that the Burwick and McKusick hypothesis is falsified relative to the stylometric criterion and analytic methodology used

Newcastle University eTheses

Cluster validation using information stability measures

Author: Pascual Damaris
Pla Filiberto
Sánchez Garreta Josep Salvador
Publication venue: 'Elsevier BV'
Publication date: 01/01/2010
Field of study

In this work, a novel technique to address the problem of cluster validation based on cluster stability properties is presented. The stability index here proposed is based on the variation on some information measures over the partitions generated by a given clustering model due to the variability in clustering solutions produced by different sample sets. © 2009 Elsevier B.V. All rights reserved

Repositori Institucional de la Universitat Jaume I