729 research outputs found

    Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures

    Get PDF
    Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data

    Model Based Automatic and Robust Spike Sorting for Large Volumes of Multi-channel Extracellular Data

    Get PDF
    abstract: Spike sorting is a critical step for single-unit-based analysis of neural activities extracellularly and simultaneously recorded using multi-channel electrodes. When dealing with recordings from very large numbers of neurons, existing methods, which are mostly semiautomatic in nature, become inadequate. This dissertation aims at automating the spike sorting process. A high performance, automatic and computationally efficient spike detection and clustering system, namely, the M-Sorter2 is presented. The M-Sorter2 employs the modified multiscale correlation of wavelet coefficients (MCWC) for neural spike detection. At the center of the proposed M-Sorter2 are two automatic spike clustering methods. They share a common hierarchical agglomerative modeling (HAM) model search procedure to strategically form a sequence of mixture models, and a new model selection criterion called difference of model evidence (DoME) to automatically determine the number of clusters. The M-Sorter2 employs two methods differing by how they perform clustering to infer model parameters: one uses robust variational Bayes (RVB) and the other uses robust Expectation-Maximization (REM) for Student’s -mixture modeling. The M-Sorter2 is thus a significantly improved approach to sorting as an automatic procedure. M-Sorter2 was evaluated and benchmarked with popular algorithms using simulated, artificial and real data with truth that are openly available to researchers. Simulated datasets with known statistical distributions were first used to illustrate how the clustering algorithms, namely REMHAM and RVBHAM, provide robust clustering results under commonly experienced performance degrading conditions, such as random initialization of parameters, high dimensionality of data, low signal-to-noise ratio (SNR), ambiguous clusters, and asymmetry in cluster sizes. For the artificial dataset from single-channel recordings, the proposed sorter outperformed Wave_Clus, Plexon’s Offline Sorter and Klusta in most of the comparison cases. For the real dataset from multi-channel electrodes, tetrodes and polytrodes, the proposed sorter outperformed all comparison algorithms in terms of false positive and false negative rates. The software package presented in this dissertation is available for open access.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Latent protein trees

    Get PDF
    Unbiased, label-free proteomics is becoming a powerful technique for measuring protein expression in almost any biological sample. The output of these measurements after preprocessing is a collection of features and their associated intensities for each sample. Subsets of features within the data are from the same peptide, subsets of peptides are from the same protein, and subsets of proteins are in the same biological pathways, therefore, there is the potential for very complex and informative correlational structure inherent in these data. Recent attempts to utilize this data often focus on the identification of single features that are associated with a particular phenotype that is relevant to the experiment. However, to date, there have been no published approaches that directly model what we know to be multiple different levels of correlation structure. Here we present a hierarchical Bayesian model which is specifically designed to model such correlation structure in unbiased, label-free proteomics. This model utilizes partial identification information from peptide sequencing and database lookup as well as the observed correlation in the data to appropriately compress features into latent proteins and to estimate their correlation structure. We demonstrate the effectiveness of the model using artificial/benchmark data and in the context of a series of proteomics measurements of blood plasma from a collection of volunteers who were infected with two different strains of viral influenza.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS639 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Modelling background air pollution exposure in urban environments: Implications for epidemiological research

    Get PDF
    Background pollution represents the lowest levels of ambient air pollution to which the population is chronically exposed, but few studies have focused on thoroughly characterizing this regime. This study uses clustering statistical techniques as a modelling approach to characterize this pollution regime while deriving reliable information to be used as estimates of exposure in epidemiological studies. The background levels of four key pollutants in five urban areas of Andalusia (Spain) were characterized over an 11-year period (2005e2015) using four widely-known clustering methods. For each pollutant data set, the first (lowest) cluster representative of the background regime was studied using finite mixture models, agglomerative hierarchical clustering, hidden Markov models (hmm) and k-means. Clustering method hmm outperforms the rest of the techniques used, providing important estimates of exposures related to background pollution as its mean, acuteness and time incidence values in the ambient air for all the air pollutants and sites studied

    Determining the number of clusters and distinguishing overlapping clusters in data analysis

    Get PDF
    Le processus de Clustering permet de construire une collection d’objets (clusters) similaires au sein d’un même groupe, et dissimilaires quand ils appartiennent à des groupes différents. Dans cette thèse, on s’intéresse a deux problèmes majeurs d’analyse de données: 1) la détermination automatique du nombre de clusters dans un ensemble de données dont on a aucune information sur les structures qui le composent; 2) le phénomène de recouvrement entre les clusters. La plupart des algorithmes de clustering souffrent du problème de la détermination du nombre de clusters qui est souvent laisse à l’utilisateur. L’approche classique pour déterminer le nombre de clusters est basée sur un processus itératif qui minimise une fonction objectif appelé indice de validité. Notre but est de: 1) développer un nouvel indice de validité pour mesurer la qualité d’une partition, qui est le résultat d’un algorithme de clustering; 2) proposer un nouvel algorithme de clustering flou pour déterminer automatiquement le nombre de clusters. Une application de notre nouvel algorithme est présentée. Elle consiste à la sélection des caractéristiques dans une base de données. Le phénomène de recouvrement entre les clusters est un des problèmes difficile dans la reconnaissance de formes statistiques. La plupart des algorithmes de clustering ont des difficultés à distinguer les clusters qui se chevauchent. Dans cette thèse, on a développé une théorie qui caractérise le phénomène de recouvrement entre les clusters dans un modèle de mélange Gaussien d’une manière formelle. À partir de cette théorie, on a développé un nouvel algorithme qui calcule le degré de recouvrement entre les clusters dans le cas multidimensionnel. Dans ce cadre précis, on a étudié les facteurs qui affectent la valeur théorique du degré de recouvrement. On a démontré comment cette théorie peut être utilisée pour la génération des données de test valides et concrètes pour une évaluation objective des indices de validité pax rapport à leurs capacités à distinguer les clusters qui se chevauchent. Finalement, notre théorie est utilisable dans une application de segmentation des images couleur en utilisant un algorithme de clustering hiérarchique

    Cluster validity in clustering methods

    Get PDF
    corecore