3 research outputs found
Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study
Validation is one of the most important aspects of clustering, particularly when the user is designing a trustworthy or explainable system. However, most clustering validation approaches require batch calculation. This is an important gap because of the value of clustering in real-time data streaming and other online learning applications. Therefore, interest has grown in providing online alternatives for validation. This paper extends the incremental cluster validity index (iCVI) family by presenting incremental versions of Calinski-Harabasz (iCH), Pakhira-Bandyopadhyay-Maulik (iPBM), WB index (iWB), Silhouette (iSIL), Negentropy Increment (iNI), Representative Cross Information Potential (irCIP), Representative Cross Entropy (irH), and Conn_Index (iConn_Index). This paper also provides a thorough comparative study of correct, under- and over-partitioning on the behavior of these iCVIs, the Partition Separation (PS) index as well as four recently introduced iCVIs: incremental Xie-Beni (iXB), incremental Davies-Bouldin (iDB), and incremental generalized Dunn\u27s indices 43 and 53 (iGD43 and iGD53). Experiments were carried out using a framework that was designed to be as agnostic as possible to the clustering algorithms. The results on synthetic benchmark data sets showed that while evidence of most under-partitioning cases could be inferred from the behaviors of the majority of these iCVIs, over-partitioning was found to be a more challenging problem, detected by fewer of them. Interestingly, over-partitioning, rather then under-partitioning, was more prominently detected on the real-world data experiments within this study. The expansion of iCVIs provides significant novel opportunities for assessing and interpreting the results of unsupervised lifelong learning in real-time, wherein samples cannot be reprocessed due to memory and/or application constraints
Neuroengineering of Clustering Algorithms
Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv
Recommended from our members
A hybrid methodology for data clustering
This thesis introduces and evaluates a new hybrid method for the searching for groups in data - a process referred to as cluster analysis. The Agglomerative - Partitional Clustering methodology (APC) proposed in this work is a novel solution to the clustering problem intended for use with large, noisy data sets and capable of recovering clusters of arbitrary shape.
Large sample size, noise and nonhyperellipsoidal cluster shapes can create difficulties for many clustering algorithms. Many commonly used clustering techniques are too inefficient to handle large data sets found in many data analysis problems or are limited by the fact that they implicitly or explicitly define clusters as being hyperellipsoidal (i.e. “globular” in shape) and can therefore fail to recover other types of cluster structure. Moreover, the presence of noise can also make detection of cluster structures problematic, particularly for clustering techniques that are explicitly designed to handle nonhyperellipsoidal cluster structures.
APC is able to circumvent these difficulties by hybridising a number of diverse approaches to clustering. Large data sets are dealt with by hybridising fast pattern partitioning techniques with hierarchical and density search methods. Arbitrary cluster shapes are handled by a unique linked line segment representation of cluster shape. In short, rather than representing clusters with their centroids, the clusters are represented via a piecewise linear approximation of the cluster structure. This enables APC to represent any cluster shape that is piecewise linearly approximatable.
The purpose of this thesis, therefore, is to introduce APC and to evaluate the ability of APC to recover cluster structure under the conditions described above. First, it is argued that there is a dearth of clustering techniques that can process large, noisy data sets where there exists arbitrarily shaped clusters. Next, the APC approach to clustering is described in detail. Here it is discussed how APC is able to handle voluminous and noisy data without being constrained to any particular cluster shapes. Moreover, as APC represents a hybridisation of clustering strategies and techniques, different ways of implementing APC are also evaluated.
The remainder of this thesis is concerned with the evaluation of APC. First, APC is empirically compared to other clustering methods via Monte Carlo simulation on a number of complex data sets. A wide variety of experimental conditions examining cluster shape, dispersion, noise and dimensionality are covered. The use of APC as a data reduction method is also examined. This final experiment also highlights the utility of the linked line segment representation of cluster shape proposed in this thesis.
Finally, the concluding chapter summarises the results and limitations of this thesis and discusses some future directions this research could take