143,748 research outputs found
External clustering validity index based on chi-squared statistical test
Clustering is one of the most commonly used techniques in data mining. Its main goal is
to group objects into clusters so that each group contains objects that are more similar to
each other than to objects in other clusters. The evaluation of a clustering solution is a task
carried out through the application of validity indices. These indices measure the quality
of the solution and can be classified as either internal that calculate the quality of the
solution through the data of the clusters, or as external indices that measure the quality
by means of external information such as the class. Generally, indices from the literature
determine their optimal result through graphical representation, whose results could be
imprecisely interpreted. The aim of this paper is to present a new external validity index
based on the chi-squared statistical test named Chi Index, which presents accurate results
that require no further interpretation. Chi Index was analyzed using the clustering results
of 3 clustering methods in 47 public datasets. Results indicate a better hit rate and a lower
percentage of error against 15 external validity indices from the literature.Ministerio de EconomĂa y Competitividad TIN2014-55894-C2-RMinisterio de EconomĂa y Competitividad TIN2017-88209-C2-2-
A novel ensemble clustering for operational transients classification with application to a nuclear power plant turbine
International audienceThe objective of the present work is to develop a novel approach for combining in an ensemble multiple base clusterings of operational transients of industrial equipment, when the number of clusters in the final consensus clustering is unknown. A measure of pairwise similarity is used to quantify the co-association matrix that describes the similarity among the different base clusterings. Then, a Spectral Clustering technique of literature, embedding the unsupervised K-Means algorithm, is applied to the co-association matrix for finding the optimum number of clusters of the final consensus clustering, based on Silhouette validity index calculation. The proposed approach is developed with reference to an artificial case study, properly designed to mimic the signal trend behavior of a Nuclear Power Plant (NPP) turbine during shutdown. The results of the artificial case have been compared with those achieved by a state-of-art approach, known as Cluster-based Similarity Partitioning and Serial Graph Partitioning and Fill-reducing Matrix Ordering Algorithms (CSPA-METIS). The comparison shows that the proposed approach is able to identify a final consensus clustering that classifies the transients with better accuracy and robustness compared to the CSPA-METIS approach. The approach is, then, validated on an industrial case concerning 149 shutdown transients of a NPP turbine
Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes
A key issue in cluster analysis is the choice of an appropriate clustering
method and the determination of the best number of clusters. Different
clusterings are optimal on the same data set according to different criteria,
and the choice of such criteria depends on the context and aim of clustering.
Therefore, researchers need to consider what data analytic characteristics the
clusters they are aiming at are supposed to have, among others within-cluster
homogeneity, between-clusters separation, and stability. Here, a set of
internal clustering validity indexes measuring different aspects of clustering
quality is proposed, including some indexes from the literature. Users can
choose the indexes that are relevant in the application at hand. In order to
measure the overall quality of a clustering (for comparing clusterings from
different methods and/or different numbers of clusters), the index values are
calibrated for aggregation. Calibration is relative to a set of random
clusterings on the same data. Two specific aggregated indexes are proposed and
compared with existing indexes on simulated and real data.Comment: 42 pages, 11 figure
Identifying hidden contexts
In this study we investigate how to identify hidden contexts from the data in classification tasks.
Contexts are artifacts in the data, which do not predict the class label directly.
For instance, in speech recognition task speakers might have different accents, which do not directly discriminate between the spoken words.
Identifying hidden contexts is considered as data preprocessing task, which can help to build more accurate classifiers, tailored for particular contexts and give an insight into the data structure.
We present three techniques to identify hidden contexts, which hide class label information from the input data and partition it using clustering techniques.
We form a collection of performance measures to ensure that the resulting contexts are valid.
We evaluate the performance of the proposed techniques on thirty real datasets.
We present a case study illustrating how the identified contexts can be used to build specialized more accurate classifiers
- …