426,549 research outputs found
Cluster validation by measurement of clustering characteristics relevant to the user
There are many cluster analysis methods that can produce quite different
clusterings on the same dataset. Cluster validation is about the evaluation of
the quality of a clustering; "relative cluster validation" is about using such
criteria to compare clusterings. This can be used to select one of a set of
clusterings from different methods, or from the same method ran with different
parameters such as different numbers of clusters.
There are many cluster validation indexes in the literature. Most of them
attempt to measure the overall quality of a clustering by a single number, but
this can be inappropriate. There are various different characteristics of a
clustering that can be relevant in practice, depending on the aim of
clustering, such as low within-cluster distances and high between-cluster
separation.
In this paper, a number of validation criteria will be introduced that refer
to different desirable characteristics of a clustering, and that characterise a
clustering in a multidimensional way. In specific applications the user may be
interested in some of these criteria rather than others. A focus of the paper
is on methodology to standardise the different characteristics so that users
can aggregate them in a suitable way specifying weights for the various
criteria that are relevant in the clustering application at hand.Comment: 20 pages 2 figure
Properties from relativistic coupled-cluster without truncation: hyperfine constants of , , and
We demonstrate an iterative scheme for coupled-cluster properties
calculations without truncating the dressed properties operator. For
validation, magnetic dipole hyperfine constants of alkaline Earth ions are
calculated with relativistic coupled-cluster and role of electron correlation
examined. Then, a detailed analysis of the higher order terms is carried out.
Based on the results, we arrive at an optimal form of the dressed operator.
Which we recommend for properties calculations with relativistic
coupled-cluster theory.Comment: 13 pages, 4 figures, 5 table
clValid: An R Package for Cluster Validation
The R package clValid contains functions for validating the results of a clustering analysis. There are three main types of cluster validation measures available, "internal", "stability", and "biological". The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM), and model-based clustering. In addition, we provide a function to perform the self-organizing tree algorithm (SOTA) method of clustering. Any combination of validation measures and clustering methods can be requested in a single function call. This allows the user to simultaneously evaluate several clustering algorithms while varying the number of clusters, to help determine the most appropriate method and number of clusters for the dataset of interest. Additionally, the package can automatically make use of the biological information contained in the Gene Ontology (GO) database to calculate the biological validation measures, via the annotation packages available in Bioconductor. The function returns an object of S4 class "clValid", which has summary, plot, print, and additional methods which allow the user to display the optimal validation scores and extract clustering results.
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
There are two notoriously hard problems in cluster analysis, estimating the
number of clusters, and checking whether the population to be clustered is not
actually homogeneous. Given a dataset, a clustering method and a cluster
validation index, this paper proposes to set up null models that capture
structural features of the data that cannot be interpreted as indicating
clustering. Artificial datasets are sampled from the null model with parameters
estimated from the original dataset. This can be used for testing the null
hypothesis of a homogeneous population against a clustering alternative. It can
also be used to calibrate the validation index for estimating the number of
clusters, by taking into account the expected distribution of the index under
the null model for any given number of clusters. The approach is illustrated by
three examples, involving various different clustering techniques (partitioning
around medoids, hierarchical methods, a Gaussian mixture model), validation
indexes (average silhouette width, prediction strength and BIC), and issues
such as mixed type data, temporal and spatial autocorrelation
An empirical study on the visual cluster validation method with Fastmap
This paper presents an empirical study on the visual method for cluster validation based on the Fastmap projection. The visual cluster validation method attempts to tackle two clustering problems in data mining: to verify partitions of data created by a clustering algorithm; and to identify genuine clusters from data partitions. They are achieved through projecting objects and clusters by Fastmap to the 2D space and visually examining the results by humans. A Monte Carlo evaluation of the visual method was conducted. The validation results of the visual method were compared with the results of two internal statistical cluster validation indices, which shows that the visual method is in consistence with the statistical validation methods. This indicates that the visual cluster validation method is indeed effective and applicable to data mining applications.published_or_final_versio
Combining Cluster Validation Indices for Detecting Label Noise
In this paper, we show that cluster validation indices can be used for filtering mislabeled instances or class outliers prior to training in supervised learning problems. We propose a technique, entitled Cluster Validation Index (CVI)-based Outlier Filtering, in which mislabeled instances are identified and eliminated from the training set, and a classification hypothesis is then built from the set of remaining instances. The proposed approach assigns each instance several cluster validation scores representing its potential of being an outlier with respect to the clustering properties the used validation measures assess. We examine CVI-based Outlier Filtering and compare it against the Local Outlier Factor (LOF) detection method on ten data sets from the UCI data repository using five well-known learning algorithms and three different cluster validation indices. In addition, we study and compare three different approaches for combining the selected cluster validation measures. Our results show that for most learning algorithms and data sets, the proposed CVI-based outlier filtering algorithm outperforms the baseline method (LOF). The greatest increase in classification accuracy has been achieved by using union or ranked-based median strategies to assemble the used cluster validation indices and global filtering of mislabeled instances
- …