10 research outputs found

    Hierarchical Dirichlet process model for gene expression clustering

    Get PDF
    Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments

    Clustering Algorithms: Their Application to Gene Expression Data

    Get PDF
    Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure

    Population food intake clusters and cardiovascular disease incidence: a Bayesian quantifying of a prospective population-based cohort study in a low and middle-income country

    Get PDF
    AimsThis study was designed to explore the relationship between cardiovascular disease incidence and population clusters, which were established based on daily food intake.MethodsThe current study examined 5,396 Iranian adults (2,627 males and 2,769 females) aged 35 years and older, who participated in a 10-year longitudinal population-based study that began in 2001. The frequency of food group consumption over the preceding year (daily, weekly, or monthly) was assessed using a 49-item qualitative food frequency questionnaire (FFQ) administered via a face-to-face interview conducted by an expert dietitian. Participants were clustered based on their dietary intake by applying the semi-parametric Bayesian approach of the Dirichlet Process. In this approach, individuals with the same multivariate distribution based on dietary intake were assigned to the same cluster. The association between the extracted population clusters and the incidence of cardiovascular diseases was examined using Cox proportional hazard models.ResultsIn the 10-year follow-up, 741 participants (401 men and 340 women) were diagnosed with cardiovascular diseases. Individuals were categorized into three primary dietary clusters: healthy, unhealthy, and mixed. After adjusting for potential confounders, subjects in the unhealthy cluster exhibited a higher risk for cardiovascular diseases [Hazard Ratio (HR): 2.059; 95% CI: 1.013, 4.184] compared to those in the healthy cluster. In the unadjusted model, individuals in the mixed cluster demonstrated a higher risk for cardiovascular disease than those in the healthy cluster (HR: 1.515; 95% CI: 1.097, 2.092). However, this association was attenuated after adjusting for potential confounders (HR: 1.145; 95% CI: 0.769, 1.706).ConclusionThe results have shown that individuals within an unhealthy cluster have a risk that is twice as high for the incidence of cardiovascular diseases. However, these associations need to be confirmed through further prospective investigations

    Computational approaches for single-cell omics and multi-omics data

    Get PDF
    Single-cell omics and multi-omics technologies have enabled the study of cellular heterogeneity with unprecedented resolution and the discovery of new cell types. The core of identifying heterogeneous cell types, both existing and novel ones, relies on efficient computational approaches, including especially cluster analysis. Additionally, gene regulatory network analysis and various integrative approaches are needed to combine data across studies and different multi-omics layers. This thesis comprehensively compared Bayesian clustering models for single-cell RNAsequencing (scRNA-seq) data and selected integrative approaches were used to study the cell-type specific gene regulation of uterus. Additionally, single-cell multi-omics data integration approaches for cell heterogeneity analysis were investigated. Article I investigated analytical approaches for cluster analysis in scRNA-seq data, particularly, latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP) models. The comparison of LDA and HDP together with the existing state-of-art methods revealed that topic modeling-based models can be useful in scRNA-seq cluster analysis. Evaluation of the cluster qualities for LDA and HDP with intrinsic and extrinsic cluster quality metrics indicated that the clustering performance of these methods is dataset dependent. Article II and Article III focused on cell-type specific integrative analysis of uterine or decidual stromal (dS) and natural killer (dNK) cells that are important for successful pregnancy. Article II integrated the existing preeclampsia RNA-seq studies of the decidua together with recent scRNA-seq datasets in order to investigate cell-type-specific contributions of early onset preeclampsia (EOP) and late onset preeclampsia (LOP). It was discovered that the dS marker genes were enriched for LOP downregulated genes and the dNK marker genes were enriched for upregulated EOP genes. Article III presented a gene regulatory network analysis for the subpopulations of dS and dNK cells. This study identified novel subpopulation specific transcription factors that promote decidualization of stromal cells and dNK mediated maternal immunotolerance. In Article IV, different strategies and methodological frameworks for data integration in single-cell multi-omics data analysis were reviewed in detail. Data integration methods were grouped into early, late and intermediate data integration strategies. The specific stage and order of data integration can have substantial effect on the results of the integrative analysis. The central details of the approaches were presented, and potential future directions were discussed.  Laskennallisia menetelmiä yksisolusekvensointi- ja multiomiikkatulosten analyyseihin Yksisolusekvensointitekniikat mahdollistavat solujen heterogeenisyyden tutkimuksen ennennäkemättömällä resoluutiolla ja uusien solutyyppien löytämisen. Solutyyppien tunnistamisessa keskeisessä roolissa on ryhmittely eli klusterointianalyysi. Myös geenien säätelyverkostojen sekä eri molekyylidatatasojen yhdistäminen on keskeistä analyysissä. Väitöskirjassa verrataan bayesilaisia klusterointimenetelmiä ja yhdistetään eri menetelmillä kerättyjä tietoja kohdun solutyyppispesifisessä geeninsäätelyanalyysissä. Lisäksi yksisolutiedon integraatiomenetelmiä selvitetään kattavasti. Julkaisu I keskittyy analyyttisten menetelmien, erityisesti latenttiin Dirichletallokaatioon (LDA) ja hierarkkiseen Dirichlet-prosessiin (HDP) perustuvien mallien tutkimiseen yksisoludatan klusterianalyysissä. Kattava vertailu näiden kahden mallin sekä olemassa olevien menetelmien kanssa paljasti, että aihemallinnuspohjaiset menetelmät voivat olla hyödyllisiä yksisoludatan klusterianalyysissä. Menetelmien suorituskyky riippui myös kunkin analysoitavan datasetin ominaisuuksista. Julkaisuissa II ja III keskitytään naisen lisääntymisterveydelle tärkeiden kohdun stroomasolujen ja NK-immuunisolujen solutyyppispesifiseen analyysiin. Artikkelissa II yhdistettiin olemassa olevia tuloksia pre-eklampsiasta viimeisimpiin yksisolusekvensointituloksiin ja löydettiin varhain alkavan pre-eklampsian (EOP) ja myöhään alkavan pre-eklampsian (LOP) solutyyppispesifisiä vaikutuksia. Havaittiin, että erilaistuneen strooman markkerigeenien ilmentyminen vähentyi LOP:ssa ja NK-markkerigeenien ilmentyminen lisääntyi EOP:ssa. Julkaisu III analysoi strooman ja NK-solujen alapopulaatiospesifisiä geeninsäätelyverkostoja ja niiden transkriptiofaktoreita. Tutkimus tunnisti uusia alapopulaatiospesifisiä säätelijöitä, jotka edistävät strooman erilaistumista ja NK-soluvälitteistä immunotoleranssia Julkaisu IV tarkastelee yksityiskohtaisesti strategioita ja menetelmiä erilaisten yksisoludatatasojen (multi-omiikka) integroimiseksi. Integrointimenetelmät ryhmiteltiin varhaisen, myöhäisen ja välivaiheen strategioihin ja kunkin lähestymistavan menetelmiä esiteltiin tarkemmin. Lisäksi keskusteltiin mahdollisista tulevaisuuden suunnista

    Biological network models for inferring mechanism of action, characterizing cellular phenotypes, and predicting drug response

    Get PDF
    A primary challenge in the analysis of high-throughput biological data is the abundance of correlated variables. A small change to a gene's expression or a protein's binding availability can cause significant downstream effects. The existence of such chain reactions presents challenges in numerous areas of analysis. By leveraging knowledge of the network interactions that underlie this type of data, we can often enable better understanding of biological phenomena. This dissertation will examine network-based statistical approaches to the problems of mechanism-of-action inference, characterization of gene expression changes, and prediction of drug response. First, we develop a method for multi-target perturbation detection in multi-omics biological data. We estimate a joint Gaussian graphical model across multiple data types using penalized regression, and filter for network effects. Next, we apply a set of likelihood ratio tests to identify the most likely site of the original perturbation. We also present a conditional testing procedure to allow for detection of secondary perturbations. Second, we address the problem of characterization of cellular phenotypes via Bayesian regression in the Gene Ontology (GO). In our model, we use the structure of the GO to assign changes in gene expression to functional groups, and to model the covariance between these groups. In addition to describing changes in expression, we use these functional activity estimates to predict the expression of unobserved genes. We further determine when such predictions are likely to be inaccurate by identifying GO terms with poor agreement to gene-level estimates. In a case study, we identify GO terms relevant to changes in the growth rate of S. cerevisiae. Lastly, we consider the prediction of drug sensitivity in cancer cell lines based on pathway-level activity estimates from ASSIGN, a Bayesian factor analysis model. We use penalized regression to predict response to various cancer treatments based on cancer subtype, pathway activity, and 2-way interactions thereof. We also present network representations of these interaction models and examine common patterns in their structure across treatments

    Integrativer Ansatz zur Identifizierung neuer, prognostisch relevanter Metagene mittels Clusteranalyse

    Get PDF
    In Germany, breast cancer is the most common leading cause of cancer deaths in women. To gain insight into the processes related to the course of the disease, human genetic data can be used to identify associations between gene expression and prognosis. In the course of the several clinical studies and numerous microarray experiments, the enormous data volume is constantly generated. Its dimensionality reduction of thousands of genes to a smaller number is the aim of the so-called metagenes that aggregate the expression data of groups of genes with similar expression patterns and may be used for investigating complex diseases like breast cancer. Here, a cluster analytic approach for identification of potentially relevant metagenes is introduced. In a first step of the approach, gene expression patterns over time of receptor tyrosine kinase ErbB2 breast cancer MCF7 cell lines to obtain promising sets of genes for a metagene calculation were used. Three independent batches of MCF7/NeuT cells were exposed to doxycycline for periods of 0, 6, 12 and 24 hours as well as for 3 and 14 days in independent experiments, due to association of the oncogenic variant of ErbB2 overexpression in breast cancer with worse prognosis. With cluster analytic approaches DIB-C (difference-based clustering algorithm) and STEM (short time-series expression miner) as well as with the finite and infinite mixture models gene clusters with similar expression patterns were identified. Two non-model-based algorithms – k-means and PFP (penalized frame potential) – as well as the model-based procedure DIRECT were applied for the method comparisons. Potentially relevant gene groups were selected by promoter and Gene Ontology (GO) analysis. The verification of the applied methods was carried out with another short time-series data set. In the second step of the approach, this gene clusters were used to calculate metagenes of the gene expression data of 766 breast cancer patients from three breast cancer studies and Cox models were applied to determine the effect of the detected metagenes on the prognosis. Using this strategy, new metagenes associated with metastasis-free survival patients were identified.In Deutschland ist Brustkrebs die häufigste Krebserkrankung bei Frauen. Durch zahlreiche klinische Studien auf diesem Gebiet konnte festgestellt werden, dass die veränderten Gene zwar nicht zwangsläufig zum Ausbruch der Krankheit führen, deren Expressionen jedoch näher analysiert werden sollten, um das Karzinom rechtzeitig zu erkennen und dadurch bessere Therapien zu ermöglichen. Hierbei wird durch die Microarray-Experimente ein enormes Datenvolumen generiert, deren Dimensionsreduktion von mehreren Tausend Genen zu einer deutlich kleineren Anzahl angestrebt wird. Eine Möglichkeit bieten die sogenannten Metagene, zu denen Gene mit ähnlichen Expressionen zusammengefasst werden können und die sich als prognostische Faktoren für das Überleben der Patienten erwiesen haben. In der vorliegenden Arbeit wird ein neuer integrativer Ansatz zur Clusterung kurzer Expressionszeitreihen zur Identifizierung prognostisch relevanter Metagene vorgestellt. Der erste Teil des Ansatzes beruht auf der Analyse humaner Mammakarzinom-Zelllinien MCF7. Die onkogene Variante der Rezeptortyrosinkinase ErbB2, deren Überexpression mit einer schlechteren Prognose assoziiert ist, wurde in diesen MCF7-Zelllinien induziert und zu den Zeitpunkten 0, 6, 12 und 24 Stunden sowie und 3 und 14 Tagen nach der Induktion beobachtet. Mit den Clusteranalyseansätzen DIB-C (difference-based clustering algorithm) und STEM (short time-series expression miner) sowie mit den finiten und den infiniten Mischungsmodellen werden hier Gengruppen mit ähnlichen Expressionsverläufen identifiziert. Als Vergleichsmethoden werden die nicht-modellbasierten Algorithmen k-means und PFP (penalized frame potential) und das in R implementierte Tool DIRECT als modellbasierter Vergleich zur Analyse herangezogen. Mit der Gene Ontology (GO) - bzw. Promoteranalyse werden die biologisch interessantesten Cluster ermittelt. Zur Verifizierung der hier angewendeten Methoden wird ein weiterer Datensatz mit Expressionswerten kurzer Zeitreihen erfolgreich herangezogen. Im zweiten Teil des Ansatzes werden für diese Gruppen Metagene gebildet und auf ihre prognostische Relevanz in den Brustkrebsdaten von 766 Patientinnen mittels Überlebenszeitanalyse untersucht und so neue biologisch relevante Cluster aufgedeckt

    Identifying Patterns of Cancer Disease Mechanisms by Mining Alternative Representations of Genomic Alterations

    Get PDF
    Cancer is a complex disease driven by somatic genomic alterations (SGAs) that perturb signaling pathways and consequently cellular function. Identifying combinatorial patterns of pathway perturbations would provide insights into common disease mechanisms shared among tumors, which is important for guiding treatment and predicting outcome. However, identifying perturbed pathways is challenging, because different tumors can have the same perturbed pathways that are perturbed by different SGAs. We started off by designing a novel semantic representation that captures the functional similarity of distinct SGAs perturbing a common pathway in different tumors. This representation was used alongside the nested hierarchical Dirichlet process topic model in order to identify combinatorial patterns in altered signaling pathways. We found that the topic model was able to capture the functional relationships between topics. It was also able to identify cancer subtypes composed of tumors from different tissues of origin that exhibit different survival rates. These results led us to investigate the performance of the methodology on pan-cancer data, as well as in conjunction with cancer driver data. The results revealed that the framework was still able to identify clinically relevant features in pan-cancer. However, the addition of driver data decreased the noise in the data and improved the separation of tumors in the clustering results. This provided support for including the use of driver data in our methodology. In order to have gene representations independent of literature, we developed a biological representation that could identify functionally related genes. Its performance when used alongside topic modeling was tested. We found that the topic association patterns separated tumors by their tissue of origin. But, analyzing some of the cancer types on an individual basis still led to significant differences in survival. Our studies show the potential for using alternative representations in conjunction with topic modeling to investigate complex genomic diseases. With further research and refinement of this methodology, it has the potential to capture the relationship between pathways involved in cancer. This would contribute to a better understanding of cancer disease mechanisms and treatment
    corecore