7 research outputs found

    Statistical Integration of Heterogeneous Data with PO2PLS

    Full text link
    The availability of multi-omics data has revolutionized the life sciences by creating avenues for integrated system-level approaches. Data integration links the information across datasets to better understand the underlying biological processes. However, high-dimensionality, correlations and heterogeneity pose statistical and computational challenges. We propose a general framework, probabilistic two-way partial least squares (PO2PLS), which addresses these challenges. PO2PLS models the relationship between two datasets using joint and data-specific latent variables. For maximum likelihood estimation of the parameters, we implement a fast EM algorithm and show that the estimator is asymptotically normally distributed. A global test for testing the relationship between two datasets is proposed, and its asymptotic distribution is derived. Notably, several existing omics integration methods are special cases of PO2PLS. Via extensive simulations, we show that PO2PLS performs better than alternatives in feature selection and prediction performance. In addition, the asymptotic distribution appears to hold when the sample size is sufficiently large. We illustrate PO2PLS with two examples from commonly used study designs: a large population cohort and a small case-control study. Besides recovering known relationships, PO2PLS also identified novel findings. The methods are implemented in our R-package PO2PLS. Supplementary materials for this article are available online.Comment: 36 pages, 4 figures, Submitted to Journal of the American Statistical Associatio

    Statistical integration of multi-omics and drug screening data from cell lines

    Get PDF
    Data integration methods are used to obtain a unified summary of multiple datasets. For multi-modal data, we propose a computational workflow to jointly analyze datasets from cell lines. The workflow comprises a novel probabilistic data integration method, named POPLS-DA, for multi-omics data.The workflow is motivated by a study on synucleinopathies where transcriptomics, proteomics, and drug screening data are measured in affected LUHMES cell lines and controls. The aim is to highlight potentially druggable pathways and genes involved in synucleinopathies. First, POPLS-DA is used to prioritize genes and proteins that best distinguish cases and controls. For these genes, an integrated interaction network is constructed where the drug screen data is incorporated to highlight druggable genes and pathways in the network. Finally, sfunctional enrichment analyses are performed to identify clusters of synaptic and lysosome-related genes and proteins targeted by the protective drugs. POPLS-DA is compared to other single- and multi-omics approaches.We found that HSPA5, a member of the heat shock protein 70 family, was one of the most targeted genes by the validated drugs, in particular by AT1-blockers. HSPA5 and AT1-blockers have been previously linked to alpha-synuclein pathology and Parkinson's disease, showing the relevance of our findings.Our computational workflow identified new directions for therapeutic targets for synucleinopathies. POPLS-DA provided a larger interpretable gene set than other single- and multi-omic approaches. An implementation based on R and markdown is freely available online. We present a computational workflow that combines the analysis of different types of data measured in cell line studies with non-overlapping samples. We apply the workflow to measurements of gene expression, protein abundances, and a screening of a wide range of FDA-approved drugs. These different types of data are obtained from LUHMES brain cells and jointly analyzed to discover new treatment options in synucleinopathies, such as Parkinson's disease. Our workflow includes a new probabilistic method, named POPLS-DA. POPLS-DA combines the analysis of the genes and proteins to pinpoint a set of relevant genes and proteins that can distinguish affected and non-affected cells. Compared to other approaches, POPLS-DA found a larger set of genes relevant to the disease. Further, we constructed a network that connects the relevant genes and proteins that interact with each other. We incorporate the drug screening data to highlight which part of the network is relevant to the disease and druggable. Through additional analysis of the functionality, we discovered that the genes and proteins that are targeted by protective drugs share relevant properties, namely they are synaptic and lysosome-related genes. Notably, we found that specific types of drugs, namely AT1-blockers such as Telmisartan, are protective and target the network of relevant genes and proteins. These drugs are approved by the FDA and readily available to further investigate their potential in treating synucleinopathies. We further found that a gene named HSPA5, a member of the heat shock protein 70 family, is highly targeted by the protective drugs. This gene has been linked to Parkinson's disease in previous scientific literature. Our computational workflow and the implementation in R and markdown are freely available online

    The risk profile of patients with COVID-19 as predictors of lung lesions severity and mortality—Development and validation of a prediction model

    Get PDF
    ObjectiveWe developed and validated a prediction model based on individuals' risk profiles to predict the severity of lung involvement and death in patients hospitalized with coronavirus disease 2019 (COVID-19) infection.MethodsIn this retrospective study, we studied hospitalized COVID-19 patients with data on chest CT scans performed during hospital stay (February 2020-April 2021) in a training dataset (TD) (n = 2,251) and an external validation dataset (eVD) (n = 993). We used the most relevant demographical, clinical, and laboratory variables (n = 25) as potential predictors of COVID-19-related outcomes. The primary and secondary endpoints were the severity of lung involvement quantified as mild (≤25%), moderate (26–50%), severe (>50%), and in-hospital death, respectively. We applied random forest (RF) classifier, a machine learning technique, and multivariable logistic regression analysis to study our objectives.ResultsIn the TD and the eVD, respectively, the mean [standard deviation (SD)] age was 57.9 (18.0) and 52.4 (17.6) years; patients with severe lung involvement [n (%):185 (8.2) and 116 (11.7)] were significantly older [mean (SD) age: 64.2 (16.9), and 56.2 (18.9)] than the other two groups (mild and moderate). The mortality rate was higher in patients with severe (64.9 and 38.8%) compared to moderate (5.5 and 12.4%) and mild (2.3 and 7.1%) lung involvement. The RF analysis showed age, C reactive protein (CRP) levels, and duration of hospitalizations as the three most important predictors of lung involvement severity at the time of the first CT examination. Multivariable logistic regression analysis showed a significant strong association between the extent of the severity of lung involvement (continuous variable) and death; adjusted odds ratio (OR): 9.3; 95% CI: 7.1–12.1 in the TD and 2.6 (1.8–3.5) in the eVD.ConclusionIn hospitalized patients with COVID-19, the severity of lung involvement is a strong predictor of death. Age, CRP levels, and duration of hospitalizations are the most important predictors of severe lung involvement. A simple prediction model based on available clinical and imaging data provides a validated tool that predicts the severity of lung involvement and death probability among hospitalized patients with COVID-19

    Evaluation of O2PLS in Omics data integration

    Get PDF
    Background: Rapid computational and technological developments made large amounts of omics data available in different biological levels. It is becoming clear that simultaneous data analysis methods are needed for better interpretation and understanding of the underlying systems biology. Different methods have been proposed for this task, among them Partial Least Squares (PLS) related methods. To also deal with orthogonal variation, systematic variation in the data unrelated to one another, we consider the Two-way Orthogonal PLS (O2PLS): an integrative data analysis method which is capable of modeling systematic variation, while providing more parsimonious models aiding interpretation. Results: A simulation study to assess the performance of O2PLS showed positive results in both low and higher dimensions. More noise (50 % of the data) only affected the systematic part estimates. A data analysis was conducted using data on metabolomics and transcriptomics from a large Finnish cohort (DILGOM). A previous sequential study, using the same data, showed significant correlations between the Lipo-Leukocyte (LL) module and lipoprotein metabolites. The O2PLS results were in agreement with these findings, identifying almost the same set of co-varying variables. Moreover, our integrative approach identified other associative genes and metabolites, while taking into account systematic variation in the data. Including orthogonal components enhanced overall fit, but the orthogonal variation was difficult to interpret. Conclusions: Simulations showed that the O2PLS estimates were close to the true parameters in both low and higher dimensions. In the presence of more noise (50 %), the orthogonal part estimates could not distinguish well between joint and unique variation. The joint estimates were not systematically affected. Simultaneous analysis with O2PLS on metabolome and transcriptome data showed that the LL module, together with VLDL and HDL metabolites, were important for the metabolomic and transcriptomic relation. This is in agreement with an earlier study. In addition more gene expression and metabolites are identified being important for the joint covariation

    Statistical integration of heterogeneous omics data: Probabilistic two-way partial least squares (PO2PLS)

    No full text
    The availability of multi-omics data has revolutionized the life sciences by creating avenues for integrated system-level approaches. Data integration links the information across datasets to better understand the underlying biological processes. However, high dimensionality, correlations and heterogeneity pose statistical and computational challenges. We propose a general framework, probabilistic two-way partial least squares (PO2PLS), that addresses these challenges. PO2PLS models the relationship between two datasets using joint and data-specific latent variables. For maximum likelihood estimation of the parameters, we propose a novel fast EM algorithm and show that the estimator is asymptotically normally distributed. A global test for the relationship between two datasets is proposed, specifically addressing the high dimensionality, and its asymptotic distribution is derived. Notably, several existing data integration methods are special cases of PO2PLS. Via extensive simulations, we show that PO2PLS performs better than alternatives in feature selection and prediction performance. In addition, the asymptotic distribution appears to hold when the sample size is sufficiently large. We illustrate PO2PLS with two examples from commonly used study designs: a large population cohort and a small case–control study. Besides recovering known relationships, PO2PLS also identified novel findings. The methods are implemented in our R-package PO2PLS.Statistic

    Statistical integration of multi-omics and drug screening data from cell lines.

    No full text
    Data integration methods are used to obtain a unified summary of multiple datasets. For multi-modal data, we propose a computational workflow to jointly analyze datasets from cell lines. The workflow comprises a novel probabilistic data integration method, named POPLS-DA, for multi-omics data. The workflow is motivated by a study on synucleinopathies where transcriptomics, proteomics, and drug screening data are measured in affected LUHMES cell lines and controls. The aim is to highlight potentially druggable pathways and genes involved in synucleinopathies. First, POPLS-DA is used to prioritize genes and proteins that best distinguish cases and controls. For these genes, an integrated interaction network is constructed where the drug screen data is incorporated to highlight druggable genes and pathways in the network. Finally, functional enrichment analyses are performed to identify clusters of synaptic and lysosome-related genes and proteins targeted by the protective drugs. POPLS-DA is compared to other single- and multi-omics approaches. We found that HSPA5, a member of the heat shock protein 70 family, was one of the most targeted genes by the validated drugs, in particular by AT1-blockers. HSPA5 and AT1-blockers have been previously linked to α-synuclein pathology and Parkinson's disease, showing the relevance of our findings. Our computational workflow identified new directions for therapeutic targets for synucleinopathies. POPLS-DA provided a larger interpretable gene set than other single- and multi-omic approaches. An implementation based on R and markdown is freely available online
    corecore