34,076 research outputs found

    Comparació de tècniques de clustering en una base de dades de salut

    Get PDF
    Les tècniques de clustering tenen l’objectiu de trobar patrons amagats dins de les dades i, particularment, dividir un conjunt d’observacions en grups acord a un conjunt de mesures. Els primers mètodes es van desenvolupar als anys 30 i 40 i avui dia n’existeixen més d’un centenar. En aquest treball s’han estudiat tres tècniques de Hard clustering: K-means, clustering jeràrquic, K-medoids i una tècnica de Soft clustering: Gaussian Mixture Models. Addicionalment, s’han seleccionat aleatòriament dues mostres de 200 persones de l’estudi de salut ELSA amb els objectius d’il·lustrar aquests mètodes per descobrir quins s’adapten millor a aquestes dades i de determinar grups de persones, estratificats per sexe, amb perfils de salut comuns. El K-means i el clustering jeràrquic aglomeratiu són les tècniques que han presentat els millors resultats. En canvi, els Gaussian Mixture Models és el mètode que pitjor s’ha adaptat a les dues mostres analitzades.Clustering techniques aim to find hidden patterns within the data and to divide a set of observations into groups according to a set of measures. The first methods were developed in the 1930s and 1940s and today there are more than a hundred. In this work, three Hard clustering techniques: K-means, hierarchical clustering, K-medoids and a Soft clustering technique: Gaussian Mixture Models have been studied. In addition, two samples of 200 people have been randomly selected from the ELSA health study with the objectives of illustrating these methods to discover which are best suited to these data and to determine groups of people, stratified by sex, with common health profiles. K-means and agglomerative hierarchical clustering were the best performing techniques. On the other hand, the Gaussian Mixture Models is the method that has been the worst adapted to the two samples analyzed

    Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

    Full text link
    There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed type data, temporal and spatial autocorrelation

    Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization

    Full text link
    We study the problem of detecting a structured, low-rank signal matrix corrupted with additive Gaussian noise. This includes clustering in a Gaussian mixture model, sparse PCA, and submatrix localization. Each of these problems is conjectured to exhibit a sharp information-theoretic threshold, below which the signal is too weak for any algorithm to detect. We derive upper and lower bounds on these thresholds by applying the first and second moment methods to the likelihood ratio between these "planted models" and null models where the signal matrix is zero. Our bounds differ by at most a factor of root two when the rank is large (in the clustering and submatrix localization problems, when the number of clusters or blocks is large) or the signal matrix is very sparse. Moreover, our upper bounds show that for each of these problems there is a significant regime where reliable detection is information- theoretically possible but where known algorithms such as PCA fail completely, since the spectrum of the observed matrix is uninformative. This regime is analogous to the conjectured 'hard but detectable' regime for community detection in sparse graphs.Comment: For sparse PCA and submatrix localization, we determine the information-theoretic threshold exactly in the limit where the number of blocks is large or the signal matrix is very sparse based on a conditional second moment method, closing the factor of root two gap in the first versio

    kk-MLE: A fast algorithm for learning statistical mixture models

    Full text link
    We describe kk-MLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectation-maximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering kk-MLE algorithm iteratively assigns data to the most likely weighted component and update the component models using Maximum Likelihood Estimators (MLEs). Using the duality between exponential families and Bregman divergences, we prove that the local convergence of the complete likelihood of kk-MLE follows directly from the convergence of a dual additively weighted Bregman hard clustering. The inner loop of kk-MLE can be implemented using any kk-means heuristic like the celebrated Lloyd's batched or Hartigan's greedy swap updates. We then show how to update the mixture weights by minimizing a cross-entropy criterion that implies to update weights by taking the relative proportion of cluster points, and reiterate the mixture parameter update and mixture weight update processes until convergence. Hard EM is interpreted as a special case of kk-MLE when both the component update and the weight update are performed successively in the inner loop. To initialize kk-MLE, we propose kk-MLE++, a careful initialization of kk-MLE guaranteeing probabilistically a global bound on the best possible complete likelihood.Comment: 31 pages, Extend preliminary paper presented at IEEE ICASSP 201

    Surrogate modeling approximation using a mixture of experts based on EM joint estimation

    Get PDF
    An automatic method to combine several local surrogate models is presented. This method is intended to build accurate and smooth approximation of discontinuous functions that are to be used in structural optimization problems. It strongly relies on the Expectation-Maximization (EM) algorithm for Gaussian mixture models (GMM). To the end of regression, the inputs are clustered together with their output values by means of parameter estimation of the joint distribution. A local expert is then built (linear, quadratic, artificial neural network, moving least squares) on each cluster. Lastly, the local experts are combined using the Gaussian mixture model parameters found by the EM algorithm to obtain a global model. This method is tested over both mathematical test cases and an engineering optimization problem from aeronautics and is found to improve the accuracy of the approximation
    • …
    corecore