15 research outputs found

    Statistical integration of diverse omics data

    Get PDF
    This thesis is concerned with statistical methodology for jointly analyzing multiple types of omics data. These datasets provide information on several biological levels, and an integrated analysis can lead to a better understanding of whole biological system. Due to the strong correlations within and between datasets, high dimensionality, and systematic differences between datasets, novel methods are needed. We consider latent variable modeling where strong correlations are incorporated, dimension reduction is performed, and heterogeneity between omics data is modeled. The first part of the thesis studies current data integration methods applied to population cohorts and their software implementations. In the second part, we propose a novel probabilistic data integration framework to model the relation between omics data: PO2PLS. This framework allows for statistical inference and helps reduce overfitting. The PO2PLS framework can be used to integrate multiple omics data with various study designs. LUMC / Geneeskund

    Integrating omics datasets with the OmicsPLS package

    Get PDF
    Background: With the exponential growth in available biomedical data, there is a need for data integration methods that can extract information about relationships between the data sets. However, these data sets might have very different characteristics. For interpretable results, data-specific variation needs to be quantified. For this task, Two-way Orthogonal Partial Least Squares (O2PLS) has been proposed. To facilitate application and development of the methodology, free and open-source software is required. However, this is not the case with O2PLS. Results: We introduce OmicsPLS, an open-source implementation of the O2PLS method in R. It can handle both low- and high-dimensional datasets efficiently. Generic methods for inspecting and visualizing results are implemented. Both a standard and faster alternative cross-validation methods are available to determine the number of components. A simulation study shows good performance of OmicsPLS compared to alternatives, in terms of accuracy and CPU runtime. We demonstrate OmicsPLS by integrating genetic and glycomic data. Conclusions: We propose the OmicsPLS R package: a free and open-source implementation of O2PLS for statistical data integration. OmicsPLS is available at https://cran.r-project.org/package=OmicsPLSand can be installed in R via install.packages("OmicsPLS")

    Integrating omics datasets with the OmicsPLS package

    Get PDF
    Background With the exponential growth in available biomedical data, there is a need for data integration methods that can extract information about relationships between the data sets. However, these data sets might have very different characteristics. For interpretable results, data-specific variation needs to be quantified. For this task, Two-way Orthogonal Partial Least Squares (O2PLS) has been proposed. To facilitate application and development of the methodology, free and open-source software is required. However, this is not the case with O2PLS. Results We introduce OmicsPLS, an open-source implementation of the O2PLS method in R. It can handle both low- and high-dimensional datasets efficiently. Generic methods for inspecting and visualizing results are implemented. Both a standard and faster alternative cross-validation methods are available to determine the number of components. A simulation study shows good performance of OmicsPLS compared to alternatives, in terms of accuracy and CPU runtime. We demonstrate OmicsPLS by integrating genetic and glycomic data. Conclusions We propose the OmicsPLS R package: a free and open-source implementation of O2PLS for statistical data integration. OmicsPLS is available at https://cran.r-project.org/package=OmicsPLS and can be installed in R via install.packages(“OmicsPLS”)

    Multi-omics integration identifies key upstream regulators of pathomechanisms in hypertrophic cardiomyopathy due to truncating MYBPC3 mutations

    Get PDF
    BACKGROUND: Hypertrophic cardiomyopathy (HCM) is the most common genetic disease of the cardiac muscle, frequently caused by mutations in MYBPC3. However, little is known about the upstream pathways and key regulators causing the disease. Therefore, we employed a multi-omics approach to study the pathomechanisms underlying HCM comparing patient hearts harboring MYBPC3 mutations to control hearts. RESULTS: Using H3K27ac ChIP-seq and RNA-seq we obtained 9310 differentially acetylated regions and 2033 differentially expressed genes, respectively, between 13 HCM and 10 control hearts. We obtained 441 differentially expressed proteins between 11 HCM and 8 control hearts using proteomics. By integrating multi-omics datasets, we identified a set of DNA regions and genes that differentiate HCM from control hearts and 53 protein-coding genes as the major contributors. This comprehensive analysis consistently points toward altered extracellular matrix formation, muscle contraction, and metabolism. Therefore, we studied enriched transcription factor (TF) binding motifs and identified 9 motif-encoded TFs, including KLF15, ETV4, AR, CLOCK, ETS2, GATA5, MEIS1, RXRA, and ZFX. Selected candidates were examined in stem cell-derived cardiomyocytes with and without mutated MYBPC3. Furthermore, we observed an abundance of acetylation signals and transcripts derived from cardiomyocytes compared to non-myocyte populations. CONCLUSIONS: By integrating histone acetylome, transcriptome, and proteome profiles, we identified major effector genes and protein networks that drive the pathological changes in HCM with mutated MYBPC3. Our work identifies 38 highly affected protein-coding genes as potential plasma HCM biomarkers and 9 TFs as potential upstream regulators of these pathomechanisms that may serve as possible therapeutic targets

    Integration and exploration of High Dimensional data

    No full text
    In modern Life Sciences, high dimensional correlated data matrices are often dealt with. Ordinary linear regression fail and PCA does not take into account the outcome variable(s). Also there might be large amount of orthogonal variation present. To deal with this, the O2PLS model (which is based on the PLS model) is derived. O2PLS is symmetric and predictive, with the covariance matrix playing an important role. The algorithm is given, and important remarks are made. A simulation study and real data analysis is conducted. A derivation for Probabilistic O2PLS is made, the estimation method being maximum likelihood. All calculations were done in R, code is available on request.Applied MathematicsStatisticsElectrical Engineering, Mathematics and Computer Scienc

    De supremumfout van de Grenander schatter

    No full text
    In de toegepaste statistiek spelen dalende kansdichtheden een grote rol. Denk aan kansmodellen voor de levensduur van artikelen of risico's voor levensverzekeringen. Om deze risico's af te dekken is het cruciaal om over goede schattingsmethoden voor deze onbekende kansdichtheden te beschikken. Als de steekproefgrootte uit een dalende kansdichtheid naar oneindig gaat, convergeert onder een aantal voorwaarden de supremumafstand tussen deze dichtheid en zijn `beste' schatter, de Grenander schatter, naar een standaard Gumbel verdeling. Er wordt onderzocht in hoeverre de resultaten bruikbaar zijn voor eindige steekproefsgrootte. Ook worden een aantal interessante opmerkingen gemaakt die van belang zijn bij de praktische toepassing en verder onderzoek.StatisticsApplied mathematicsElectrical Engineering, Mathematics and Computer Scienc

    Probabilistic partial least squares model: Identifiability, estimation and application

    Get PDF
    With a rapid increase in volume and complexity of data sets, there is a need for methods that can extract useful information, for example the relationship between two data sets measured for the same persons. The Partial Least Squares (PLS) method can be used for this dimension reduction task. Within life sciences, results across studies are compared and combined. Therefore, parameters need to be identifiable, which is not the case for PLS. In addition, PLS is an algorithm, while epidemiological study designs are often outcome-dependent and methods to analyze such data require a probabilistic formulation. Moreover, a probabilistic model provides a statistical framework for inference. To address these issues, we develop Probabilistic PLS (PPLS). We derive maximum likelihood estimators that satisfy the identifiability conditions by using an EM algorithm with a constrained optimization in the M step. We show that the PPLS parameters are identifiable up to sign. A simulation study is conducted to study the performance of PPLS compared to existing methods. The PPLS estimates performed well in various scenarios, even in high dimensions. Most notably, the estimates seem to be robust against departures from normality. To illustrate our method, we applied it to IgG glycan data from two cohorts. Our PPLS model provided insight as well as interpretable results across the two cohorts

    Multi-omics integration identifies key upstream regulators of pathomechanisms in hypertrophic cardiomyopathy due to truncating MYBPC3 mutations

    No full text
    Background: Hypertrophic cardiomyopathy (HCM) is the most common genetic disease of the cardiac muscle, frequently caused by mutations in MYBPC3. However, little is known about the upstream pathways and key regulators causing the disease. Therefore, we employed a multi-omics approach to study the pathomechanisms underlying HCM comparing patient hearts harboring MYBPC3 mutations to control hearts. Results: Using H3K27ac ChIP-seq and RNA-seq we obtained 9310 differentially acetylated regions and 2033 differentially expressed genes, respectively, between 13 HCM and 10 control hearts. We obtained 441 differentially expressed proteins between 11 HCM and 8 control hearts using proteomics. By integrating multi-omics datasets, we identified a set of DNA regions and genes that differentiate HCM from control hearts and 53 protein-coding genes as the major contributors. This comprehensive analysis consistently points toward altered extracellular matrix formation, muscle contraction, and metabolism. Therefore, we studied enriched transcription factor (TF) binding motifs and identified 9 motif-encoded TFs, including KLF15, ETV4, AR, CLOCK, ETS2, GATA5, MEIS1, RXRA, and ZFX. Selected candidates were examined in stem cell-derived cardiomyocytes with and without mutated MYBPC3. Furthermore, we observed an abundance of acetylation signals and transcripts derived from cardiomyocytes compared to non-myocyte populations. Conclusions: By integrating histone acetylome, transcriptome, and proteome profiles, we identified major effector genes and protein networks that drive the pathological changes in HCM with mutated MYBPC3. Our work identifies 38 highly affected protein-coding genes as potential plasma HCM biomarkers and 9 TFs as potential upstream regulators of these pathomechanisms that may serve as possible therapeutic targets
    corecore