162,935 research outputs found

    Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

    Full text link
    There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed type data, temporal and spatial autocorrelation

    Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

    Get PDF
    There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed-type data, temporal and spatial autocorrelation

    Analysis of FMRI Exams Through Unsupervised Learning and Evaluation Index

    Get PDF
    In the last few years, the clustering of time series has seen significant growth and has proven effective in providing useful information in various domains of use. This growing interest in time series clustering is the result of the effort made by the scientific community in the context of time data mining. For these reasons, the first phase of the thesis focused on the study of the data obtained from fMRI exams carried out in task-based and resting state mode, using and comparing different clustering algorithms: SelfOrganizing map (SOM), the Growing Neural Gas (GNG) and Neural Gas (NG) which are crisp-type algorithms, a fuzzy algorithm, the Fuzzy C algorithm, was also used (FCM). The evaluation of the results obtained by using clustering algorithms was carried out using the Davies Bouldin evaluation index (DBI or DB index). Clustering evaluation is the second topic of this thesis. To evaluate the validity of the clustering, there are specific techniques, but none of these is already consolidated for the study of fMRI exams. Furthermore, the evaluation of evaluation techniques is still an open research field. Eight clustering validation indexes (CVIs) applied to fMRI data clustering will be analysed. The validation indices that have been used are Pakhira Bandyopadhyay Maulik Index (crisp and fuzzy), Fukuyama Sugeno Index, Rezaee Lelieveldt Reider Index, Wang Sun Jiang Index, Xie Beni Index, Davies Bouldin Index, Soft Davies Bouldin Index. Furthermore, an evaluation of the evaluation indices will be carried out, which will take into account the sub-optimal performance obtained by the indices, through the introduction of new metrics. Finally, a new methodology for the evaluation of CVIs will be introduced, which will use an ANFIS model

    Vinayaka: A semi-supervised projected clustering method using differential evolution

    Get PDF
    ABSTRACT a semi-supervised projected clustering method based on DE. In this method DE optimizes a hybrid cluster validation index. Subspace Clustering Quality Estimate index (SCQE index) is used for internal cluster validation and Gini index gain is used for external cluster validation in the proposed hybrid cluster validation index. Proposed method is applied on Wisconsin breast cancer dataset

    Evolutionary Automatic Text Summarization using Cluster Validation Indexes

    Get PDF
    The main problem for generating an extractive automatic text summary (EATS) is to detect the key themes of a text. For this task, unsupervised approaches cluster the sentences of the original text to find the key sentences that take part in an automatic summary. The quality of an automatic summary is evaluated using similarity metrics with human-made summaries. However, the relationship between the quality of the human-made summaries and the internal quality of the clustering is unclear. First, this paper proposes a comparison of the correlation of the quality of a human-made summary to the internal quality of the clustering validation index for finding the best correlation with a clustering validation index. Second, in this paper, an evolutionary method based on the best above internal clustering validation index for an automatic text summarization task is proposed. Our proposed unsupervised method for EATS has the advantage of not requiring information regarding the specific classes or themes of a text, and is therefore domain- and language-independent. The high results obtained by our method, using the most-competitive standard collection for EATS, prove that our method maintains a high correlation with human-made summaries, meeting the specific features of the groups, for example, compaction, separation, distribution, and density

    Medoid-based shadow value validation and visualization

    Get PDF
    A silhouette index is a well-known measure of an internal criteria validation for the clustering algorithm results. While it is a medoid-based validation index, a centroid-based validation index that is called a centroid-based shadow value (CSV) has been developed.  Although both are similar, the CSV has an additional unique property where an image of a 2-dimensional neighborhood graph is possible. A new internal validation index is proposed in this article in order to create a medoid-based validation that has an ability to visualize the results in a 2-dimensional plot. The proposed index behaves similarly to the silhouette index and produces a network visualization, which is comparable to the neighborhood graph of the CSV. The network visualization has a multiplicative parameter (c) to adjust its edges visibility. Due to the medoid-based, in addition, it is more an appropriate visualization technique for any type of data than a neighborhood graph of the CSV

    Multiple Imputation based Clustering Validation (MIV) for Big Longitudinal Trial Data with Missing Values in eHealth

    Get PDF
    Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services
    • …
    corecore