74 research outputs found

    Fuzzy cluster validation using the partition negentropy criterion

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-04277-5_24Proceedings of the 19th International Conference, Limassol, Cyprus, September 14-17, 2009We introduce the Partition Negentropy Criterion (PNC) for cluster validation. It is a cluster validity index that rewards the average normality of the clusters, measured by means of the negentropy, and penalizes the overlap, measured by the partition entropy. The PNC is aimed at finding well separated clusters whose shape is approximately Gaussian. We use the new index to validate fuzzy partitions in a set of synthetic clustering problems, and compare the results to those obtained by the AIC, BIC and ICL criteria. The partitions are obtained by fitting a Gaussian Mixture Model to the data using the EM algorithm. We show that, when the real clusters are normally distributed, all the criteria are able to correctly assess the number of components, with AIC and BIC allowing a higher cluster overlap. However, when the real cluster distributions are not Gaussian (i.e. the distribution assumed by the mixture model) the PNC outperforms the other indices, being able to correctly evaluate the number of clusters while the other criteria (specially AIC and BIC) tend to overestimate it.This work has been partially supported with funds from MEC BFU2006-07902/BFI, CAM S-SEM-0255-2006 and CAM/UAM project CCG08-UAM/TIC-442

    Normality-based validation for crisp clustering

    Full text link
    This is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition, 43, 36, (2010) DOI 10.1016/j.patcog.2009.09.018We introduce a new validity index for crisp clustering that is based on the average normality of the clusters. Unlike methods based on inter-cluster and intra-cluster distances, this index emphasizes the cluster shape by using a high order characterization of its probability distribution. The normality of a cluster is characterized by its negentropy, a standard measure of the distance to normality which evaluates the difference between the cluster's entropy and the entropy of a normal distribution with the same covariance matrix. The definition of the negentropy involves the distribution's differential entropy. However, we show that it is possible to avoid its explicit computation by considering only negentropy increments with respect to the initial data distribution, where all the points are assumed to belong to the same cluster. The resulting negentropy increment validity index only requires the computation of covariance matrices. We have applied the new index to an extensive set of artificial and real problems where it provides, in general, better results than other indices, both with respect to the prediction of the correct number of clusters and to the similarity among the real clusters and those inferred.This work has been partially supported with funds from MEC BFU2006-07902/BFI, CAM S-SEM-0255-2006 and CAM/UAM CCG08-UAM/TIC-442

    The effect of low number of points in clustering validation via the negentropy increment

    Full text link
    This is the author’s version of a work that was accepted for publication in Neurocomputing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neurocomputing, 74, 16, (2011) DOI: 10.1016/j.neucom.2011.03.023Selected papers of the 10th International Work-Conference on Artificial Neural Networks (IWANN2009)We recently introduced the negentropy increment, a validity index for crisp clustering that quantifies the average normality of the clustering partitions using the negentropy. This index can satisfactorily deal with clusters with heterogeneous orientations, scales and densities. One of the main advantages of the index is the simplicity of its calculation, which only requires the computation of the log-determinants of the covariance matrices and the prior probabilities of each cluster. The negentropy increment provides validation results which are in general better than those from other classic cluster validity indices. However, when the number of data points in a partition region is small, the quality in the estimation of the log-determinant of the covariance matrix can be very poor. This affects the proper quantification of the index and therefore the quality of the clustering, so additional requirements such as limitations on the minimum number of points in each region are needed. Although this kind of constraints can provide good results, they need to be adjusted depending on parameters such as the dimension of the data space. In this article we investigate how the estimation of the negentropy increment of a clustering partition is affected by the presence of regions with small number of points. We find that the error in this estimation depends on the number of points in each region, but not on the scale or orientation of their distribution, and show how to correct this error in order to obtain an unbiased estimator of the negentropy increment. We also quantify the amount of uncertainty in the estimation. As we show, both for 2D synthetic problems and multidimensional real benchmark problems, these results can be used to validate clustering partitions with a substantial improvement.This work has been funded by DGUI-CAM/UAM (Project CCG10-UAM/TIC-5864

    Evaluation of negentropy-based cluster validation techniques in problems with increasing dimensionality

    Full text link
    The aim of a crisp cluster validity index is to quantify the quality of a given data partition. It allows to select the best partition out of a set of potential ones, and to determine the number of clusters. Recently, negentropy-based cluster validation has been introduced. This new approach seems to perform better than other state of the art techniques, and its computation is quite simple. However, like many other cluster validation approaches, it presents problems when some partition regions have a small number of points. Different heuristics have been proposed to cope with this problem. In this article we systematically analyze the performance of different negentropy-based validation approaches, including a new heuristic, in clustering problems of increasing dimensionality, and compare them to reference criteria such as AIC and BIC. Our results on synthetic data suggest that the newly proposed negentropy-based validation strategy can outperform AIC and BIC when the ratio of the number of points to the dimension is not high, which is a very common situation in most real applications.The authors thank the financial support from DGUI-CAM/UAM (Project CCG10-UAM/TIC-5864

    Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study

    Get PDF
    Validation is one of the most important aspects of clustering, particularly when the user is designing a trustworthy or explainable system. However, most clustering validation approaches require batch calculation. This is an important gap because of the value of clustering in real-time data streaming and other online learning applications. Therefore, interest has grown in providing online alternatives for validation. This paper extends the incremental cluster validity index (iCVI) family by presenting incremental versions of Calinski-Harabasz (iCH), Pakhira-Bandyopadhyay-Maulik (iPBM), WB index (iWB), Silhouette (iSIL), Negentropy Increment (iNI), Representative Cross Information Potential (irCIP), Representative Cross Entropy (irH), and Conn_Index (iConn_Index). This paper also provides a thorough comparative study of correct, under- and over-partitioning on the behavior of these iCVIs, the Partition Separation (PS) index as well as four recently introduced iCVIs: incremental Xie-Beni (iXB), incremental Davies-Bouldin (iDB), and incremental generalized Dunn\u27s indices 43 and 53 (iGD43 and iGD53). Experiments were carried out using a framework that was designed to be as agnostic as possible to the clustering algorithms. The results on synthetic benchmark data sets showed that while evidence of most under-partitioning cases could be inferred from the behaviors of the majority of these iCVIs, over-partitioning was found to be a more challenging problem, detected by fewer of them. Interestingly, over-partitioning, rather then under-partitioning, was more prominently detected on the real-world data experiments within this study. The expansion of iCVIs provides significant novel opportunities for assessing and interpreting the results of unsupervised lifelong learning in real-time, wherein samples cannot be reprocessed due to memory and/or application constraints

    Assessing the Number of Components in Mixture Models: a Review.

    Get PDF
    Despite the widespread application of finite mixture models, the decision of how many classes are required to adequately represent the data is, according to many authors, an important, but unsolved issue. This work aims to review, describe and organize the available approaches designed to help the selection of the adequate number of mixture components (including Monte Carlo test procedures, information criteria and classification-based criteria); we also provide some published simulation results about their relative performance, with the purpose of identifying the scenarios where each criterion is more effective (adequate).Finite mixture; number of mixture components; information criteria; simulation studies.

    Neuroengineering of Clustering Algorithms

    Get PDF
    Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv

    ValidaciĂłn de clusters basada en la negentropĂ­a de las particiones

    Full text link
    Las tĂ©cnicas de clustering se basan en la agrupaciĂłn de una serie de puntos de acuerdo a un criterio de similitud, buscando que los puntos pertenecientes a un mismo cluster sean mĂĄs similares entre si de lo que lo son con el resto de puntos. El principal objetivo de este proyecto de fin de carrera es el estudio y evaluaciĂłn de mĂ©todos de validaciĂłn de clusters basados en la negentropĂ­a, asĂ­ como su comparaciĂłn con otros mĂ©todos mĂĄs tradicionales. Para ello se ha realizado un estudio del estado del arte, en el que se han evaluado diferentes mĂ©todos de clustering asĂ­ como diferentes mĂ©todos de validaciĂłn. La tĂ©cnica de clustering que hemos utilizado en este proyecto se basa en ajustar a los datos una mezcla de gaussianas utilizando el algoritmo EM. Cada una de las gaussianas que contiene el modelo devuelto por Ă©ste se corresponde con un cluster. A cada conjunto de datos se le realizan ajustes con diferente nĂșmero de gaussianas, con lo que conseguimos tener modelos con diferente nĂșmero de clusters. Los modelos devueltos por el algoritmo EM son evaluados mediante diferentes mĂ©todos de validaciĂłn de clustering, los cuales nos dan una medida de la calidad de los diferentes modelos basĂĄndose en el criterio utilizado por cada mĂ©todo de validaciĂłn. Entre estos mĂ©todos se encuentra el mĂ©todo objeto de anĂĄlisis de este proyecto, Negentropy-based Validation ( ), y dos ya establecidos en el contexto de las mezclas de distribuciones, AIC y BIC, con los que se realizarĂĄn las comparaciones. Para la evaluaciĂłn del mĂ©todo se ha generado una baterĂ­a de problemas sintĂ©ticos, escogiendo las variables que intervienen en cada problema de tal forma que al finalizar el anĂĄlisis se han obtenido unos resultados que nos han permitido comparar el desempeño de los tres mĂ©todos en un rango muy amplio de situaciones. Gracias al anĂĄlisis realizado se ha llegado a las siguientes conclusiones: AIC tiene un funcionamiento muy negativo y es un mĂ©todo que mejora el desempeño de BIC en la mayorĂ­a de los casos, planteĂĄndose como un fuerte candidato para su uso en aplicaciones con datos reales. Parte de los resultados obtenidos en este estudio han sido publicados en una revista internacional (1).The clustering techniques are based on the grouping of a number of points according to a similarity criterion, looking forward to find in a cluster points more similar to each other than to the rest of the points. The main objective of this final project at university is the study and evaluation of the clustering validation methods based on the negentrophy, and its comparison with other more traditional methods. To that end, a study of “the state of the art” has been carried out, in which different clustering and validation methods have been evaluated. The clustering technique which has been used in this project is based on adjusting a mixture of Gaussians to a dataset using the EM algorithm. Each of the Gaussians contained on the model returned by the algorithm corresponds to a cluster. Every dataset is been adjust with different number of Gaussians, in order to obtain models with different number of clusters. The models that have been returned by the EM algorithm are evaluated with different clustering validation methods, which give us an approach to the quality of the different methods based on the criterion used by each validation method. Among these methods, we can find the one under study on this project, the Negentrophy-based Validation method ( ), and two other methods already settled on the context of the distribution mixtures, the AIC and BIC methods, with which the comparisons will be make. For the evaluation of the method, a set of synthetic problems have been developed, choosing the variables involve in each problem so that, to the end of the analysis, the results obtained allow us to compare the performance of the three methods at a wide range of situations. As the result of this analysis, the main conclusions obtained are: AIC has a very negative behavior and is a method that improves the performance of BIC on most of the cases, emerging as a strong candidate for its use on real data applications. Part of the results obtained on this study has been published on an international magazine (1)
    • 

    corecore