2 research outputs found
Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study
Validation is one of the most important aspects of clustering, particularly when the user is designing a trustworthy or explainable system. However, most clustering validation approaches require batch calculation. This is an important gap because of the value of clustering in real-time data streaming and other online learning applications. Therefore, interest has grown in providing online alternatives for validation. This paper extends the incremental cluster validity index (iCVI) family by presenting incremental versions of Calinski-Harabasz (iCH), Pakhira-Bandyopadhyay-Maulik (iPBM), WB index (iWB), Silhouette (iSIL), Negentropy Increment (iNI), Representative Cross Information Potential (irCIP), Representative Cross Entropy (irH), and Conn_Index (iConn_Index). This paper also provides a thorough comparative study of correct, under- and over-partitioning on the behavior of these iCVIs, the Partition Separation (PS) index as well as four recently introduced iCVIs: incremental Xie-Beni (iXB), incremental Davies-Bouldin (iDB), and incremental generalized Dunn\u27s indices 43 and 53 (iGD43 and iGD53). Experiments were carried out using a framework that was designed to be as agnostic as possible to the clustering algorithms. The results on synthetic benchmark data sets showed that while evidence of most under-partitioning cases could be inferred from the behaviors of the majority of these iCVIs, over-partitioning was found to be a more challenging problem, detected by fewer of them. Interestingly, over-partitioning, rather then under-partitioning, was more prominently detected on the real-world data experiments within this study. The expansion of iCVIs provides significant novel opportunities for assessing and interpreting the results of unsupervised lifelong learning in real-time, wherein samples cannot be reprocessed due to memory and/or application constraints
Largeâscale pathogenicity prediction analysis of cancerâassociated kinase mutations reveals variability in sensitivity and specificity of computational methods
Abstract Background Mutations in kinases are the most frequent genetic alterations in cancer; however, experimental evidence establishing their cancerous nature is available only for a small fraction of these mutants. Aims Predicition analysis of kinome mutations is the primary aim of this study. Further objective is to compare the performance of various softwares in pathogenicity prediction of kinase mutations. Materials and methods We employed a set of computational tools to predict the pathogenicity of over fortyâtwo thousand mutations and deposited the kinaseâwise data in Mendeley database (Estimated Pathogenicity of Kinase Mutants [EPKiMu]). Results Mutations are more likely to be drivers when being present in the kinase domain (vs. nonâkinase domain) and belonging to hotspot residues (vs. nonâhotspot residues). We identified that, while predictive tools have low specificity in general, PolyPhenâ2 had the best accuracy. Further efforts to combine all four tools by consensus, voting, or other simple methods did not significantly improve accuracy. Discussion The study provides a large dataset of kinase mutations along with their predicted pathogenicity that can be used as a training set for future studies. Furthermore, a comparative sensitivity and selectivity of commonly used computational tools is presented. Conclusion Primaryâstructureâbased in silico tools identified more cancerous/deleterious mutations in the kinase domains and at the hot spot residues while having higher sensitivity than specificity in detecting deleterious mutations