11 research outputs found

    Kernel discriminant analysis and clustering with parsimonious Gaussian process models

    Full text link
    This work presents a family of parsimonious Gaussian process models which allow to build, from a finite sample, a model-based classifier in an infinite dimensional space. The proposed parsimonious models are obtained by constraining the eigen-decomposition of the Gaussian processes modeling each class. This allows in particular to use non-linear mapping functions which project the observations into infinite dimensional spaces. It is also demonstrated that the building of the classifier can be directly done from the observation space through a kernel function. The proposed classification method is thus able to classify data of various types such as categorical data, functional data or networks. Furthermore, it is possible to classify mixed data by combining different kernels. The methodology is as well extended to the unsupervised classification case. Experimental results on various data sets demonstrate the effectiveness of the proposed method

    The k-means algorithm: A comprehensive survey and performance evaluation

    Get PDF
    © 2020 by the authors. Licensee MDPI, Basel, Switzerland. The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions

    Identifying and evaluating clinical subtypes of Alzheimer's disease in care electronic health records using unsupervised machine learning

    Get PDF
    BACKGROUND: Alzheimer's disease (AD) is a highly heterogeneous disease with diverse trajectories and outcomes observed in clinical populations. Understanding this heterogeneity can enable better treatment, prognosis and disease management. Studies to date have mainly used imaging or cognition data and have been limited in terms of data breadth and sample size. Here we examine the clinical heterogeneity of Alzheimer's disease patients using electronic health records (EHR) to identify and characterise disease subgroups using multiple clustering methods, identifying clusters which are clinically actionable. METHODS: We identified AD patients in primary care EHR from the Clinical Practice Research Datalink (CPRD) using a previously validated rule-based phenotyping algorithm. We extracted and included a range of comorbidities, symptoms and demographic features as patient features. We evaluated four different clustering methods (k-means, kernel k-means, affinity propagation and latent class analysis) to cluster Alzheimer's disease patients. We compared clusters on clinically relevant outcomes and evaluated each method using measures of cluster structure, stability, efficiency of outcome prediction and replicability in external data sets. RESULTS: We identified 7,913 AD patients, with a mean age of 82 and 66.2% female. We included 21 features in our analysis. We observed 5, 2, 5 and 6 clusters in k-means, kernel k-means, affinity propagation and latent class analysis respectively. K-means was found to produce the most consistent results based on four evaluative measures. We discovered a consistent cluster found in three of the four methods composed of predominantly female, younger disease onset (43% between ages 42-73) diagnosed with depression and anxiety, with a quicker rate of progression compared to the average across other clusters. CONCLUSION: Each clustering approach produced substantially different clusters and K-Means performed the best out of the four methods based on the four evaluative criteria. However, the consistent appearance of one particular cluster across three of the four methods potentially suggests the presence of a distinct disease subtype that merits further exploration. Our study underlines the variability of the results obtained from different clustering approaches and the importance of systematically evaluating different approaches for identifying disease subtypes in complex EHR

    Kernel discriminant analysis and clustering with parsimonious Gaussian process models

    Get PDF
    International audienceThis work presents a family of parsimonious Gaussian process models which allow to build, from a finite sample, a model-based classifier in an infinite dimensional space. The proposed parsimonious models are obtained by constraining the eigen-decomposition of the Gaussian processes modeling each class. This allows in particular to use non-linear mapping functions which project the observations into infinite dimensional spaces. It is also demonstrated that the building of the classifier can be directly done from the observation space through a kernel function. The proposed classification method is thus able to classify data of various types such as categorical data, functional data or networks. Furthermore, it is possible to classify mixed data by combining different kernels. The methodology is as well extended to the unsupervised classification case. Experimental results on various data sets demonstrate the effectiveness of the proposed method

    Spectral Ranking and Unsupervised Feature Selection for Point, Collective and Contextual Anomaly Detection

    Get PDF
    Anomaly detection problems can be classified into three categories: point anomaly detection, collective anomaly detection and contextual anomaly detection. Many algorithms have been devised to address anomaly detection of a specific type from various application domains. Nevertheless, the exact type of anomalies to be detected in practice is generally unknown under unsupervised setting, and most of the methods exist in literature usually favor one kind of anomalies over the others. Applying an algorithm with an incorrect assumption is unlikely to produce reasonable results. This thesis thereby investigates the possibility of applying a uniform approach that can automatically discover different kinds of anomalies. Specifically, we are primarily interested in Spectral Ranking for Anomalies (SRA) for its potential in detecting point anomalies and collective anomalies simultaneously. We show that the spectral optimization in SRA can be viewed as a relaxation of an unsupervised SVM problem under some assumptions. SRA thereby results in a bi-class classification strength measure that can be used to rank the point anomalies, along with a normal vs. abnormal classification for identifying collective anomalies. However, in dealing with contextual anomaly problems with different contexts defined by different feature subsets, SRA and other popular methods are still not sufficient on their own. Accordingly, we propose an unsupervised backward elimination feature selection algorithm BAHSIC-AD, utilizing Hilbert-Schmidt Independence Critirion (HSIC) in identifying the data instances present as anomalies in the subset of features that have strong dependence with each other. Finally, we demonstrate the effectiveness of SRA combined with BAHSIC-AD by comparing their performance with other popular anomaly detection methods on a few benchmarks, including both synthetic datasets and real world datasets. Our computational results jusitify that, in practice, SRA combined with BAHSIC-AD can be a generally applicable method for detecting different kinds of anomalies

    Unsupervised Spectral Ranking For Anomaly Detection

    Get PDF
    Anomaly detection is the problem of finding deviations from expected normal patterns. A wide variety of applications, such as fraud detection for credit cards and insurance, medical image monitoring, network intrusion detection, and military surveillance, can be viewed as anomaly detection. For anomaly detection, obtaining accurate labels, especially labels for anomalous cases, is costly and time consuming, if not practically infeasible. This makes supervised anomaly detection less desirable in the domain of anomaly detection. In this thesis, we propose a novel unsupervised spectral ranking method for anomaly detection (SRA). Based on the 1st non-principal eigenvectors from Laplacian matrices, the proposed SRA can generate anomaly ranking either with respect to a single majority class or with respect to multiple majority classes. The ranking type is based on whether the percentage of the smaller class instances (positive or negative) is larger than the expected upper bound of the anomaly ratio. We justify the proposed spectral ranking by establishing a connection between the unsupervised support vector machine optimization and the spectral Laplacian optimization problem. Using both synthetic and real data sets, we show that our proposed SRA is a meaningful and effective alternative to the state-of-art unsupervised anomaly ranking methods. In addition, we show that, in certain scenarios, unsupervised SRA method surpasses the state-of-art unsupervised anomaly ranking methods in terms of performance and robustness of parameter tuning. Finally, we demonstrate that choosing appropriate similarity measures remains crucial in applying our proposed SRA algorithm

    Unsupervised learning methods for identifying and evaluating disease clusters in electronic health records

    Get PDF
    Introduction Clustering algorithms are a class of algorithms that can discover groups of observations in complex data and are often used to identify subtypes of heterogeneous diseases in electronic health records (EHR). Evaluating clustering experiments for biological and clinical significance is a vital but challenging task due to the lack of consensus on best practices. As a result, the translation of findings from clustering experiments to clinical practice is limited. Aim The aim of this thesis was to investigate and evaluate approaches that enable the evaluation of clustering experiments using EHR. Methods We conducted a scoping review of clustering studies in EHR to identify common evaluation approaches. We systematically investigated the performance of the identified approaches using a cohort of Alzheimer's Disease (AD) patients as an exemplar comparing four different clustering methods (K-means, Kernel K-means, Affinity Propagation and Latent Class Analysis.). Using the same population, we developed and evaluated a method (MCHAMMER) that tested whether clusterable structures exist in EHR. To develop this method we tested several cluster validation indexes and methods of generating null data to see which are the best at discovering clusters. In order to enable the robust benchmarking of evaluation approaches, we created a tool that generated synthetic EHR data that contain known cluster labels across a range of clustering scenarios. Results Across 67 EHR clustering studies, the most popular internal evaluation metric was comparing cluster results across multiple algorithms (30% of studies). We examined this approach conducting a clustering experiment on AD patients using a population of 10,065 AD patients and 21 demographic, symptom and comorbidity features. K-means found 5 clusters, Kernel K means found 2 clusters, Affinity propagation found 5 and latent class analysis found 6. K-means 4 was found to have the best clustering solution with the highest silhouette score (0.19) and was more predictive of outcomes. The five clusters found were: typical AD (n=2026), non-typical AD (n=1640), cardiovascular disease cluster (n=686), a cancer cluster (n=1710) and a cluster of mental health issues, smoking and early disease onset (n=1528), which has been found in previous research as well as in the results of other clustering methods. We created a synthetic data generation tool which allows for the generation of realistic EHR clusters that can vary in separation and number of noise variables to alter the difficulty of the clustering problem. We found that decreasing cluster separation did increase cluster difficulty significantly whereas noise variables increased cluster difficulty but not significantly. To develop the tool to assess clusters existence we tested different methods of null dataset generation and cluster validation indices, the best performing null dataset method was the min max method and the best performing indices we Calinksi Harabasz index which had an accuracy of 94%, Davies Bouldin index (97%) silhouette score ( 93%) and BWC index (90%). We further found that when clusters were identified using the Calinski Harabasz index they were more likely to have significantly different outcomes between clusters. Lastly we repeated the initial clustering experiment, comparing 10 different pre-processing methods. The three best performing methods were RBF kernel (2 clusters), MCA (4 clusters) and MCA and PCA (6 clusters). The MCA approach gave the best results highest silhouette score (0.23) and meaningful clusters, producing 4 clusters; heart and circulatory( n=1379), early onset mental health (n=1761), male cluster with memory loss (n = 1823), female with more problem (n=2244). Conclusion We have developed and tested a series of methods and tools to enable the evaluation of EHR clustering experiments. We developed and proposed a novel cluster evaluation metric and provided a tool for benchmarking evaluation approaches in synthetic but realistic EHR

    Unconstrained Learning Machines

    Get PDF
    With the use of information technology in industries, a new need has arisen in analyzing large scale data sets and automating data analysis that was once performed by human intuition and simple analog processing machines. The new generation of computer programs now has to outperform their predecessors in detecting complex and non-trivial patterns buried in data warehouses. Improved Machines Learning (ML) techniques such as Neural Networks (NNs) and Support Vector Machines (SVMs) have shown remarkable performances on supervised learning problems for the past couple of decades (e.g. anomaly detection, classification and identification, interpolation and extrapolation, etc.).Nevertheless, many such techniques have ill-conditioned structures which lack adaptability for processing exotic data or very large amounts of data. Some techniques cannot even process data in an on-line fashion. Furthermore, as the processing power of computers increases, there is a pressing need for ML algorithms to perform supervised learning tasks in less time than previously required over even larger sets of data, which means that time and memory complexities of these algorithms must be improved.The aims of this research is to construct an improved type of SVM-like algorithms for tasks such as nonlinear classification and interpolation that is more scalable, error-tolerant and accurate. Additionally, this family of algorithms must be able to compute solutions in a controlled timing, preferably small with respect to modern computational technologies. These new algorithms should also be versatile enough to have useful applications in engineering, meteorology or quality control.This dissertation introduces a family of SVM-based algorithms named Unconstrained Learning Machines (ULMs) which attempt to solve the robustness, scalability and timing issues of traditional supervised learning algorithms. ULMs are not based on geometrical analogies (e.g. SVMs) or on the replication of biological models (e.g. NNs). Their construction is strictly based on statistical considerations taken from the recently developed statistical learning theory. Like SVMs, ULMS are using kernel methods extensively in order to process exotic and/or non-numerical objects stored in databases and search for hidden patterns in data with tailored measures of similarities.ULMs are applied to a variety of problems in manufacturing engineering and in meteorology. The robust nonlinear nonparametric interpolation abilities of ULMs allow for the representation of sub-millimetric deformations on the surface of manufactured parts, the selection of conforming objects and the diagnostic and modeling of manufacturing processes. ULMs play a role in assimilating the system states of computational weather models, removing the intrinsic noise without any knowledge of the underlying mathematical models and helping the establishment of more accurate forecasts

    Graph Clustering: Algorithms, Analysis and Query Design

    Get PDF
    A wide range of applications in engineering as well as the natural and social sciences have datasets that are unlabeled. Clustering plays a major role in exploring structure in such unlabeled datasets. Owing to the heterogeneity in the applications and the types of datasets available, there are plenty of clustering objectives and algorithms. In this thesis we focus on two such clustering problems: Graph Clustering and Crowdsourced Clustering. In the first part, we consider the problem of graph clustering and study convex-optimization-based clustering algorithms. Datasets are often messy -- ridden with noise, outliers (items that do not belong to any clusters), and missing data. Therefore, we are interested in algorithms that are robust to such discrepancies. We present and analyze convex-optimization-based clustering algorithms which aim to recover the low-rank matrix that encodes the underlying cluster structure for two clustering objectives: clustering partially observed graphs and clustering similarity matrices with outliers. Using block models as generative models, we characterize the performance of these convex clustering algorithms. In particular, we provide explicit bounds, without any large unknown constants, on the problem parameters that determine the success and failure of these convex approaches. In the second part, we consider the problem of crowdsourced clustering -- the task of clustering items using answers from non-expert crowd workers who can answer similarity comparison queries. Since the workers are not experts, they provide noisy answers. Further, due to budget constraints, we cannot make all possible comparisons between items in the dataset. Thus, it is important to design queries that can reduce the noise in the responses and design algorithms that can work with noisy and partial data. We demonstrate that random triangle queries (where three items are compared per query) provide less noisy data as well as greater quantity of data, for a fixed query budget, as compared to random edge queries (where two items are compared per query). We extend the analysis of convex clustering algorithms to show that the exact recovery guarantees hold for triangle queries despite involving dependent edges. In addition to random querying strategies, we also present a novel active querying algorithm that is guaranteed to find all the clusters regardless of their sizes and without the knowledge of any parameters as long as the workers are better than random guessers. We also provide a tight upper bound on the number of queries made by the proposed active querying algorithm. Apart from providing theoretical guarantees for the clustering algorithms we also apply our algorithms to real datasets.</p
    corecore