990 research outputs found

    On clustering stability

    Get PDF
    JEL Classification: C100; C150; C380This work is dedicated to the evaluation of the stability of clustering solutions, namely the stability of crisp clusterings or partitions. We specifically refer to stability as the concordance of clusterings across several samples. In order to evaluate stability, we use a weighted cross-validation procedure, the result of which is summarized by simple and paired agreement indices values. To exclude the amount of agreement by chance of these values, we propose a new method – IADJUST – that resorts to simulated crossclassification tables. This contribution makes viable the correction of any index of agreement. Experiments on stability rely on 540 simulated data sets, design factors being the number of clusters, their balance and overlap. Six real data with a priori known clusters are also considered. The experiments conducted enable to illustrate the precision and pertinence of the IADJUST procedure and allow to know the distribution of indices under the hypothesis of agreement by chance. Therefore, we recommend the use of adjusted indices to be common practice when addressing stability. We then compare the stability of two clustering algorithms and conclude that Expectation-Maximization (EM) results are more stable when referring to unbalanced data sets than K means results. Finally, we explore the relationship between stability and external validity of a clustering solution. When all experimental scenarios’ results are considered there is a strong correlation between stability and external validity. However, within a specific experimental scenario (when a practical clustering task is considered), we find no relationship between stability and agreement with ground truth.Este trabalho é dedicado à avaliação da estabilidade de agrupamentos, nomeadamente de partições. Consideramos a estabilidade como sendo a concordância dos agrupamentos obtidos sobre diversas amostras. Para avaliar a estabilidade, usamos um procedimento de validação cruzada ponderada, cujo resultado é resumido pelos valores de índices de concordância simples e pareados. Para excluir, destes valores, a parcela de concordância por acaso, propomos um novo método - IADJUST - que recorre à simulação de tabelas cruzadas de classificação. Essa contribuição torna viável a correção de qualquer índice de concordância. A análise experimental da estabilidade baseia-se em 540 conjuntos de dados simulados, controlando os números de grupos, dimensões relativas e graus de sobreposição dos grupos. Também consideramos seis conjuntos de dados reais com classes a priori conhecidas. As experiências realizadas permitem ilustrar a precisão e pertinência do procedimento IADJUST e conhecer a distribuição dos índices sob a hipótese de concordância por acaso. Assim sendo, recomendamos a utilização de índices ajustados como prática comum ao abordar a estabilidade. Comparamos, então, a estabilidade de dois algoritmos de agrupamento e concluímos que as soluções do algoritmo Expectation Maximization são mais estáveis que as do K-médias em conjuntos de dados não balanceados. Finalmente, estudamos a relação entre a estabilidade e validade externa de um agrupamento. Agregando os resultados dos cenários experimentais obtemos uma forte correlação entre estabilidade e validade externa. No entanto, num cenário experimental particular (para uma tarefa prática de agrupamento), não encontramos relação entre estabilidade e a concordância com a verdadeira estrutura dos dados

    Skewed Factor Models Using Selection Mechanisms

    Get PDF
    Traditional factor models explicitly or implicitly assume that the factors follow a multivariate normal distribution; that is, only moments up to order two are involved. However, it may happen in real data problems that the first two moments cannot explain the factors. Based on this motivation, here we devise three new skewed factor models, the skew-normal, the skew-t, and the generalized skew-normal factor models depending on a selection mechanism on the factors. The ECME algorithms are adopted to estimate related parameters for statistical inference. Monte Carlo simulations validate our new models and we demonstrate the need for skewed factor models using the classic open/closed book exam scores dataset

    Generalized Measures for the Evaluation of Community Detection Methods

    Get PDF
    Community detection can be considered as a variant of cluster analysis applied to complex networks. For this reason, all existing studies have been using tools derived from this field when evaluating community detection algorithms. However, those are not completely relevant in the context of network analysis, because they ignore an essential part of the available information: the network structure. Therefore, they can lead to incorrect interpretations. In this article, we review these measures, and illustrate this limitation. We propose a modification to solve this problem, and apply it to the three most widespread measures: purity, Rand index and normalized mutual information (NMI). We then perform an experimental evaluation on artificially generated networks with realistic community structure. We assess the relevance of the modified measures by comparison with their traditional counterparts, and also relatively to the topological properties of the community structures. On these data, the modified NMI turns out to provide the most relevant results.Comment: The R source code (based on the igraph library) for the measures described in this article is freely available on GitHub:https://github.com/CompNet/TopoMeasure

    Cluster validity in clustering methods

    Get PDF

    Model-Based Clustering, Classification, and Discriminant Analysis Using the Generalized Hyperbolic Distribution: MixGHD R package

    Get PDF
    The MixGHD package for R performs model-based clustering, classification, and discriminant analysis using the generalized hyperbolic distribution (GHD). This approach is suitable for data that can be considered a realization of a (multivariate) continuous random variable. The GHD has the advantage of being flexible due to skewness, concentration, and index parameters; as such, clustering methods that use this distribution are capable of estimating clusters characterized by different shapes. The package provides five different models all based on the GHD, an efficient routine for discriminant analysis, and a function to measure cluster agreement. This paper is split into three parts: the first is devoted to the formulation of each method, extending them for classification and discriminant analysis applications, the second focuses on the algorithms, and the third shows the use of the package on real datasets
    • …
    corecore