15 research outputs found

    Fuzzy cluster validation using the partition negentropy criterion

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-04277-5_24Proceedings of the 19th International Conference, Limassol, Cyprus, September 14-17, 2009We introduce the Partition Negentropy Criterion (PNC) for cluster validation. It is a cluster validity index that rewards the average normality of the clusters, measured by means of the negentropy, and penalizes the overlap, measured by the partition entropy. The PNC is aimed at finding well separated clusters whose shape is approximately Gaussian. We use the new index to validate fuzzy partitions in a set of synthetic clustering problems, and compare the results to those obtained by the AIC, BIC and ICL criteria. The partitions are obtained by fitting a Gaussian Mixture Model to the data using the EM algorithm. We show that, when the real clusters are normally distributed, all the criteria are able to correctly assess the number of components, with AIC and BIC allowing a higher cluster overlap. However, when the real cluster distributions are not Gaussian (i.e. the distribution assumed by the mixture model) the PNC outperforms the other indices, being able to correctly evaluate the number of clusters while the other criteria (specially AIC and BIC) tend to overestimate it.This work has been partially supported with funds from MEC BFU2006-07902/BFI, CAM S-SEM-0255-2006 and CAM/UAM project CCG08-UAM/TIC-442

    Методология анализа данных, основанная на многоэтапной нечеткой кластеризации

    Get PDF
    В статье предлагается методология многоэтапного применения нечетких методов автоматической классификации в задачах интеллектуального анализа и обработки многомерных данных. Приводится результат вычислительного эксперимента при анализе искусственного набора данных и сформулированы предварительные выводы.A methodology of automatic classification fuzzy methods multistage application in problems of intelligent analysis and processing of multidimensional data is proposed in the paper. The result of a numerical experiment for the analysis of the artificial data set is presented and preliminary conclusions are formulated

    Quality indices for (practical) clustering evaluation

    Get PDF
    WOS:000271584000004 (Nº de Acesso Web of Science)Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters' compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution

    S-Divergence-Based Internal Clustering Validation Index

    Get PDF
    A clustering validation index (CVI) is employed to evaluate an algorithm’s clustering results. Generally, CVI statistics can be split into three classes, namely internal, external, and relative cluster validations. Most of the existing internal CVIs were designed based on compactness (CM) and separation (SM). The distance between cluster centers is calculated by SM, whereas the CM measures the variance of the cluster. However, the SM between groups is not always captured accurately in highly overlapping classes. In this article, we devise a novel internal CVI that can be regarded as a complementary measure to the landscape of available internal CVIs. Initially, a database’s clusters are modeled as a non-parametric density function estimated using kernel density estimation. Then the S-divergence (SD) and S-distance are introduced for measuring the SM and the CM, respectively. The SD is defined based on the concept of Hermitian positive definite matrices applied to density functions. The proposed internal CVI (PM) is the ratio of CM to SM. The PM outperforms the legacy measures presented in the literature on both superficial and realistic databases in various scenarios, according to empirical results from four popular clustering algorithms, including fuzzy k-means, spectral clustering, density peak clustering, and density-based spatial clustering applied to noisy data

    Application of fuzzy c-means clustering for analysis of chemical ionization mass spectra: insights into the gas-phase chemistry of NO3-initiated oxidation of isoprene

    Get PDF
    Oxidation of volatile organic compounds (VOCs) can lead to the formation of secondary organic aerosol, a significant component of atmospheric fine particles, which can affect air quality, human health, and climate change. However, current understanding of the formation mechanism of SOA is still incomplete, which is not only due to the complexity of the chemistry, but also relates to analytical challenges in SOA precursor detection and quantification. Recent instrumental advances, especially the developments of high-resolution time-of-flight chemical ionization mass spectrometry (CIMS), greatly enhanced the capability to detect low- and extremely low-volatility organic molecules (L/ELVOCs). Although detection and characterization of low volatility vapors largely improved our understanding of SOA formation, analyzing and interpreting complex mass spectrometric data remains a challenging task. This necessitates the use of dimension-reduction techniques to simplify mass spectrometric data with the purpose of extracting chemical and kinetic information of the investigated system. Here we present an approach by using fuzzy c-means clustering (FCM) to analyze CIMS data from chamber experiments aiming to investigate the gas-phase chemistry of nitrate radical initiated oxidation of isoprene. The performance of FCM was evaluated and validated. By applying FCM various oxidation products were classified into different groups according to their chemical and kinetic properties, and the common patterns of their time series were identified, which gave insights into the chemistry of the system investigated. The chemical properties are characterized by elemental ratios and average carbon oxidation state, and the kinetic behaviors are parameterized with generation number and effective rate coefficient (describing the average reactivity of a species) by using the gamma kinetic parameterization model. In addition, the fuzziness of FCM algorithm provides a possibility to separate isomers or different chemical processes species are involved in, which could be useful for mechanism development. Overall FCM is a well applicable technique to simplify complex mass spectrometric data, and the chemical and kinetic properties derived from clustering can be utilized to understand the reaction system of interest.</p

    Determining the number of clusters and distinguishing overlapping clusters in data analysis

    Get PDF
    Le processus de Clustering permet de construire une collection d’objets (clusters) similaires au sein d’un même groupe, et dissimilaires quand ils appartiennent à des groupes différents. Dans cette thèse, on s’intéresse a deux problèmes majeurs d’analyse de données: 1) la détermination automatique du nombre de clusters dans un ensemble de données dont on a aucune information sur les structures qui le composent; 2) le phénomène de recouvrement entre les clusters. La plupart des algorithmes de clustering souffrent du problème de la détermination du nombre de clusters qui est souvent laisse à l’utilisateur. L’approche classique pour déterminer le nombre de clusters est basée sur un processus itératif qui minimise une fonction objectif appelé indice de validité. Notre but est de: 1) développer un nouvel indice de validité pour mesurer la qualité d’une partition, qui est le résultat d’un algorithme de clustering; 2) proposer un nouvel algorithme de clustering flou pour déterminer automatiquement le nombre de clusters. Une application de notre nouvel algorithme est présentée. Elle consiste à la sélection des caractéristiques dans une base de données. Le phénomène de recouvrement entre les clusters est un des problèmes difficile dans la reconnaissance de formes statistiques. La plupart des algorithmes de clustering ont des difficultés à distinguer les clusters qui se chevauchent. Dans cette thèse, on a développé une théorie qui caractérise le phénomène de recouvrement entre les clusters dans un modèle de mélange Gaussien d’une manière formelle. À partir de cette théorie, on a développé un nouvel algorithme qui calcule le degré de recouvrement entre les clusters dans le cas multidimensionnel. Dans ce cadre précis, on a étudié les facteurs qui affectent la valeur théorique du degré de recouvrement. On a démontré comment cette théorie peut être utilisée pour la génération des données de test valides et concrètes pour une évaluation objective des indices de validité pax rapport à leurs capacités à distinguer les clusters qui se chevauchent. Finalement, notre théorie est utilisable dans une application de segmentation des images couleur en utilisant un algorithme de clustering hiérarchique

    A Hierarchical Method for Determining the Number of Clusters

    Get PDF
    确定数据集的聚类数目是聚类分析中一项基础性的难题.常用的trail-and-error方法通常依赖于特定的聚类算法,且在大型数据集上计算效率欠佳.提出一种基于层次思想的计算方法,不需要对数据集进行反复聚类,它首先扫描数据集获得CF(clusteringfeature,聚类特征)统计值,然后自底向上地生成不同层次的数据集划分,增量地构建一条关于不同层次划分的聚类质量曲线;曲线极值点所对应的划分用于估计最佳的聚类数目.另外,还提出一种新的聚类有效性指标用于衡量不同划分的聚类质量.该指标着重于簇的几何结构且独立于具体的聚类算法,能够识别噪声和复杂形状的簇.在实际数据和合成数据上的实验结果表明,新方法的性能优于新近提出的其他指标,同时大幅度提高了计算效率.A fundamental and difficult problem in cluster analysis is the determination of the "true" number of clusters in a dataset. The common trail-and-error method generally depends on certain clustering algorithms and is inefficient when processing large datasets. In this paper, a hierarchical method is proposed to get rid of repeatedly clustering on large datasets. The method firstly obtains the CF (clustering feature) via scanning the dataset and agglomerative generates the hierarchical partitions of dataset, then a curve of the clustering quality w.r.t the varying partitions is incrementally constructed. The partitions corresponding to the extremum of the curve is used to estimate the number of clusters finally. A new validity index is also presented to quantify the clustering quality, which is independent of clustering algorithm and emphasis on the geometric features of clusters, handling efficiently the noisy data and arbitrary shaped clusters. Experimental results on both real world and synthesis datasets demonstrate that the new method outperforms the recently published approaches, while the efficiency is significantly improved.Supported by the National Natural Science Foundation of Chinaunder GrantNo.10771176(国家自然科学基金);; the National 985 Project of Chinaunder GrantNo.0000-X07204(985工程二期平台基金);; the Scientific Research Foundation of Xiamen University of Chinaunder GrantNo.0630-X01117(厦门大学科研基金

    Clustering of fMRI data: the elusive optimal number of clusters

    Get PDF
    Model-free methods are widely used for the processing of brain fMRI data collected under natural stimulations, sleep, or rest. Among them is the popular fuzzy c-mean algorithm, commonly combined with cluster validity (CV) indices to identify the ‘true’ number of clusters (components), in an unsupervised way. CV indices may however reveal different optimal c-partitions for the same fMRI data, and their effectiveness can be hindered by the high data dimensionality, the limited signal-to-noise ratio, the small proportion of relevant voxels, and the presence of artefacts or outliers. Here, the author investigated the behaviour of seven robust CV indices. A new CV index that incorporates both compactness and separation measures is also introduced. Using both artificial and real fMRI data, the findings highlight the importance of looking at the behavior of different compactness and separation measures, defined here as building blocks of CV indices, to depict a full description of the data structure, in particular when no agreement is found between CV indices. Overall, for fMRI, it makes sense to relax the assumption that only one unique c-partition exists, and appreciate that different c-partitions (with different optimal numbers of clusters) can be useful explanations of the data, given the hierarchical organization of many brain networks

    Identifikasi sel darah merah bertumpuk menggunakan unsupervised bayesian classification pada citra mikroskopik sel darah

    Get PDF
    Jumlah sel dapat digunakan untuk menentukan jenis penyakit seperti anemia disebabkan kurangnya sel darah merah atau leukimia disebabkan lebihnya sel darah putih, dan lain-lain. Dalam proses ekstrasi sel secara otomatis dari data citra mikroskopik sel darah, salah satu masalah adalah terdapatnya sel bertumpuk yang mengakibatkan ketidakakuratan penghitungan jumlah sel. Pada Tugas Akhir ini diimplementasikan identifikasi sel darah merah bertumpuk dalam citra mikroskopik sel darah dengan unsupervised Bayesian classification. Pemisahan sel yang bertumpuk akan dimodelkan sebagai suatu cluster. Kemudian digunakan algoritma Expectation-Maximization untuk mendapatkan hasil parameter yang optimal sebagai nilai masukan Bayesian classifier. Cluster validity index digunakan untuk estimasi jumlah sel bertumpuk. Pada uji coba penghitungan sel darah merah bertumpuk menggunakan metode unsupervised bayesian classification dilakukan perbandingan dengan penghitungan secara manual. Indikator undersegmented dan oversegmented menunjukkan bentuk dari sel darah merah yang tidak tepat bertumpuk. Uji coba metode mampu memberikan hasil cukup memuaskan dengan rata-rata akurasi 87,94%, serta rata-rata viii error undersegmented dan oversegmented sebesar 3,82% dan 8,24%. ================================================================================================= Number of cells can be used to determine disease’s type like anemia caused by a lack of red blood cells or leukemia caused by the excess of white blood cells. When red blood cells are extracted automatically from microscopic images, the presence of stacked cells could lead to inaccuracies in counting the number of cells. This final project identified stacked cells from blood microscopic images for red blood cell count with unsupervised Bayesian classification. Cluster model is used to separate stacked cells. Expectation-Maximization algorithm was used to approximate the optimal values for input parameters in Bayesian classifier. Cluster validity index is used to estimate the amount of overlapped red blood cells. Experiments compared the values of red blood cell count from the implemented system with the result of manual counting. The error values of undersegmented and oversegmented indicators show red blood cells which are not precisely overlapped. The experiment results displayed some satisfactory results with the accuracy of 87,94% in addition to undersegmented error 3,82% and oversegmented error 8,24%
    corecore