162,935 research outputs found
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
There are two notoriously hard problems in cluster analysis, estimating the
number of clusters, and checking whether the population to be clustered is not
actually homogeneous. Given a dataset, a clustering method and a cluster
validation index, this paper proposes to set up null models that capture
structural features of the data that cannot be interpreted as indicating
clustering. Artificial datasets are sampled from the null model with parameters
estimated from the original dataset. This can be used for testing the null
hypothesis of a homogeneous population against a clustering alternative. It can
also be used to calibrate the validation index for estimating the number of
clusters, by taking into account the expected distribution of the index under
the null model for any given number of clusters. The approach is illustrated by
three examples, involving various different clustering techniques (partitioning
around medoids, hierarchical methods, a Gaussian mixture model), validation
indexes (average silhouette width, prediction strength and BIC), and issues
such as mixed type data, temporal and spatial autocorrelation
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed-type data, temporal and spatial autocorrelation
Analysis of FMRI Exams Through Unsupervised Learning and Evaluation Index
In the last few years, the clustering of time series has seen significant growth and has proven effective in
providing useful information in various domains of use. This growing interest in time series clustering is the
result of the effort made by the scientific community in the context of time data mining.
For these reasons, the first phase of the thesis focused on the study of the data obtained from fMRI exams
carried out in task-based and resting state mode, using and comparing different clustering algorithms: SelfOrganizing map (SOM), the Growing Neural Gas (GNG) and Neural Gas (NG) which are crisp-type
algorithms, a fuzzy algorithm, the Fuzzy C algorithm, was also used (FCM). The evaluation of the results
obtained by using clustering algorithms was carried out using the Davies Bouldin evaluation index (DBI or
DB index).
Clustering evaluation is the second topic of this thesis. To evaluate the validity of the clustering, there are
specific techniques, but none of these is already consolidated for the study of fMRI exams. Furthermore,
the evaluation of evaluation techniques is still an open research field. Eight clustering validation indexes
(CVIs) applied to fMRI data clustering will be analysed. The validation indices that have been used are
Pakhira Bandyopadhyay Maulik Index (crisp and fuzzy), Fukuyama Sugeno Index, Rezaee Lelieveldt Reider
Index, Wang Sun Jiang Index, Xie Beni Index, Davies Bouldin Index, Soft Davies Bouldin Index. Furthermore,
an evaluation of the evaluation indices will be carried out, which will take into account the sub-optimal
performance obtained by the indices, through the introduction of new metrics. Finally, a new methodology
for the evaluation of CVIs will be introduced, which will use an ANFIS model
Vinayaka: A semi-supervised projected clustering method using differential evolution
ABSTRACT a semi-supervised projected clustering method based on DE. In this method DE optimizes a hybrid cluster validation index. Subspace Clustering Quality Estimate index (SCQE index) is used for internal cluster validation and Gini index gain is used for external cluster validation in the proposed hybrid cluster validation index. Proposed method is applied on Wisconsin breast cancer dataset
Evolutionary Automatic Text Summarization using Cluster Validation Indexes
The main problem for generating an extractive automatic text summary (EATS) is to detect the key themes of a text. For this task, unsupervised approaches cluster the sentences of the original text to find the key sentences that take part in an automatic summary. The quality of an automatic summary is evaluated using similarity metrics with human-made summaries. However, the relationship between the quality of the human-made summaries and the internal quality of the clustering is unclear. First, this paper proposes a comparison of the correlation of the quality of a human-made summary to the internal quality of the clustering validation index for finding the best correlation with a clustering validation index. Second, in this paper, an evolutionary method based on the best above internal clustering validation index for an automatic text summarization task is proposed. Our proposed unsupervised method for EATS has the advantage of not requiring information regarding the specific classes or themes of a text, and is therefore domain- and language-independent. The high results obtained by our method, using the most-competitive standard collection for EATS, prove that our method maintains a high correlation with human-made summaries, meeting the specific features of the groups, for example, compaction, separation, distribution, and density
Medoid-based shadow value validation and visualization
A silhouette index is a well-known measure of an internal criteria validation for the clustering algorithm results. While it is a medoid-based validation index, a centroid-based validation index that is called a centroid-based shadow value (CSV) has been developed. Although both are similar, the CSV has an additional unique property where an image of a 2-dimensional neighborhood graph is possible. A new internal validation index is proposed in this article in order to create a medoid-based validation that has an ability to visualize the results in a 2-dimensional plot. The proposed index behaves similarly to the silhouette index and produces a network visualization, which is comparable to the neighborhood graph of the CSV. The network visualization has a multiplicative parameter (c) to adjust its edges visibility. Due to the medoid-based, in addition, it is more an appropriate visualization technique for any type of data than a neighborhood graph of the CSV
Multiple Imputation based Clustering Validation (MIV) for Big Longitudinal Trial Data with Missing Values in eHealth
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services
- …