11 research outputs found

    Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures

    Get PDF
    Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data

    Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics

    Get PDF
    Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites. google.com/site/gaussianbhc

    Discovering transcriptional modules by Bayesian data integration

    Get PDF
    Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets. Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs

    Integrativer Ansatz zur Identifizierung neuer, prognostisch relevanter Metagene mittels Clusteranalyse

    Get PDF
    In Germany, breast cancer is the most common leading cause of cancer deaths in women. To gain insight into the processes related to the course of the disease, human genetic data can be used to identify associations between gene expression and prognosis. In the course of the several clinical studies and numerous microarray experiments, the enormous data volume is constantly generated. Its dimensionality reduction of thousands of genes to a smaller number is the aim of the so-called metagenes that aggregate the expression data of groups of genes with similar expression patterns and may be used for investigating complex diseases like breast cancer. Here, a cluster analytic approach for identification of potentially relevant metagenes is introduced. In a first step of the approach, gene expression patterns over time of receptor tyrosine kinase ErbB2 breast cancer MCF7 cell lines to obtain promising sets of genes for a metagene calculation were used. Three independent batches of MCF7/NeuT cells were exposed to doxycycline for periods of 0, 6, 12 and 24 hours as well as for 3 and 14 days in independent experiments, due to association of the oncogenic variant of ErbB2 overexpression in breast cancer with worse prognosis. With cluster analytic approaches DIB-C (difference-based clustering algorithm) and STEM (short time-series expression miner) as well as with the finite and infinite mixture models gene clusters with similar expression patterns were identified. Two non-model-based algorithms – k-means and PFP (penalized frame potential) – as well as the model-based procedure DIRECT were applied for the method comparisons. Potentially relevant gene groups were selected by promoter and Gene Ontology (GO) analysis. The verification of the applied methods was carried out with another short time-series data set. In the second step of the approach, this gene clusters were used to calculate metagenes of the gene expression data of 766 breast cancer patients from three breast cancer studies and Cox models were applied to determine the effect of the detected metagenes on the prognosis. Using this strategy, new metagenes associated with metastasis-free survival patients were identified.In Deutschland ist Brustkrebs die hĂ€ufigste Krebserkrankung bei Frauen. Durch zahlreiche klinische Studien auf diesem Gebiet konnte festgestellt werden, dass die verĂ€nderten Gene zwar nicht zwangslĂ€ufig zum Ausbruch der Krankheit fĂŒhren, deren Expressionen jedoch nĂ€her analysiert werden sollten, um das Karzinom rechtzeitig zu erkennen und dadurch bessere Therapien zu ermöglichen. Hierbei wird durch die Microarray-Experimente ein enormes Datenvolumen generiert, deren Dimensionsreduktion von mehreren Tausend Genen zu einer deutlich kleineren Anzahl angestrebt wird. Eine Möglichkeit bieten die sogenannten Metagene, zu denen Gene mit Ă€hnlichen Expressionen zusammengefasst werden können und die sich als prognostische Faktoren fĂŒr das Überleben der Patienten erwiesen haben. In der vorliegenden Arbeit wird ein neuer integrativer Ansatz zur Clusterung kurzer Expressionszeitreihen zur Identifizierung prognostisch relevanter Metagene vorgestellt. Der erste Teil des Ansatzes beruht auf der Analyse humaner Mammakarzinom-Zelllinien MCF7. Die onkogene Variante der Rezeptortyrosinkinase ErbB2, deren Überexpression mit einer schlechteren Prognose assoziiert ist, wurde in diesen MCF7-Zelllinien induziert und zu den Zeitpunkten 0, 6, 12 und 24 Stunden sowie und 3 und 14 Tagen nach der Induktion beobachtet. Mit den ClusteranalyseansĂ€tzen DIB-C (difference-based clustering algorithm) und STEM (short time-series expression miner) sowie mit den finiten und den infiniten Mischungsmodellen werden hier Gengruppen mit Ă€hnlichen ExpressionsverlĂ€ufen identifiziert. Als Vergleichsmethoden werden die nicht-modellbasierten Algorithmen k-means und PFP (penalized frame potential) und das in R implementierte Tool DIRECT als modellbasierter Vergleich zur Analyse herangezogen. Mit der Gene Ontology (GO) - bzw. Promoteranalyse werden die biologisch interessantesten Cluster ermittelt. Zur Verifizierung der hier angewendeten Methoden wird ein weiterer Datensatz mit Expressionswerten kurzer Zeitreihen erfolgreich herangezogen. Im zweiten Teil des Ansatzes werden fĂŒr diese Gruppen Metagene gebildet und auf ihre prognostische Relevanz in den Brustkrebsdaten von 766 Patientinnen mittels Überlebenszeitanalyse untersucht und so neue biologisch relevante Cluster aufgedeckt
    corecore