422 research outputs found

    Structured penalized regression for drug sensitivity prediction

    Full text link
    Large-scale {\it in vitro} drug sensitivity screens are an important tool in personalized oncology to predict the effectiveness of potential cancer drugs. The prediction of the sensitivity of cancer cell lines to a panel of drugs is a multivariate regression problem with high-dimensional heterogeneous multi-omics data as input data and with potentially strong correlations between the outcome variables which represent the sensitivity to the different drugs. We propose a joint penalized regression approach with structured penalty terms which allow us to utilize the correlation structure between drugs with group-lasso-type penalties and at the same time address the heterogeneity between omics data sources by introducing data-source-specific penalty factors to penalize different data sources differently. By combining integrative penalty factors (IPF) with tree-guided group lasso, we create the IPF-tree-lasso method. We present a unified framework to transform more general IPF-type methods to the original penalized method. Because the structured penalty terms have multiple parameters, we demonstrate how the interval-search Efficient Parameter Selection via Global Optimization (EPSGO) algorithm can be used to optimize multiple penalty parameters efficiently. Simulation studies show that IPF-tree-lasso can improve the prediction performance compared to other lasso-type methods, in particular for heterogenous data sources. Finally, we employ the new methods to analyse data from the Genomics of Drug Sensitivity in Cancer project.Comment: Zhao Z, Zucknick M (2020). Structured penalized regression for drug sensitivity prediction. Journal of the Royal Statistical Society, Series C. 19 pages, 6 figures and 2 table

    Statistical integrative omics methods for disease subtype discovery

    Get PDF
    Disease phenotyping using omics data has become a popular approach that can poten-tially lead to better personalized treatment. Identifying disease subtypes via unsupervised machine learning is the ļ¬rst step towards this goal. With the accumulation of massive high-throughput omics data sets, omics data integration becomes essential to improve statistical power and reproducibility. In this dissertation, two directions from sparse K-means method will be extended. The ļ¬rst extension is a meta-analytic framework to identify novel disease subtypes when expression proļ¬les from multiple cohorts are available. The lasso regularization and meta-analysis can identify a unique set of gene features for subtype characterization. By adding pattern matching reward function, consistency of subtype signatures across studies can be achieved. The second extension is using integrating multi-level omics datasets by incorporating prior biological knowledge using sparse overlapping group lasso approach. An algorithm using alternating direction method of multiplier (ADMM) will be applied for fast optimization. For both topics, simulation and real applications in breast cancer and leukemia will show the superior clustering accuracy, feature selection and functional annotation. These methods will improved statistical power, prediction accuracy and reproducibility of disease subtype discovery analysis. Contribution to public health: The proposed methods are able to identify disease subtypes from complex multi-level or multi-cohort omics data. Disease subtype deļ¬nition is essential to deliver personalized medicine, since treating diļ¬€erent subtypes by its most appropriate medicine will achieve the most eļ¬€ective treatment eļ¬€ect and eliminate side eļ¬€ect. Omics data itself can provide better deļ¬nition of disease subtypes than regular pathological approaches. By multi-level or multi-cohort omics data, we are able to gain statistical power and reproducibility, and the resulting subtype deļ¬nition is much reliable, convincing and reproducible than single study analysis

    Assisted Network Analysis in Cancer Genomics

    Get PDF
    Cancer is a molecular disease. In the past two decades, we have witnessed a surge of high- throughput profiling in cancer research and corresponding development of high-dimensional statistical techniques. In this dissertation, the focus is on gene expression, which has played a uniquely important role in cancer research. Compared to some other types of molecular measurements, for example DNA changes, gene expressions are ā€œcloserā€ to cancer outcomes. In addition, processed gene expression data have good statistical properties, in particular, continuity. In the ā€œearlyā€ cancer gene expression data analysis, attention has been on marginal properties such as mean and variance. Genes function in a coordinated way. As such, techniques that take a system perspective have been developed to also take into account the interconnections among genes. Among such techniques, graphical models, with lucid biological interpretations and satisfactory statistical properties, have attracted special attention. Graphical model-based analysis can not only lead to a deeper understanding of genesā€™ properties but also serve as a basis for other analyses, for example, regression and clustering. Cancer molecular studies usually have limited sizes. In the graphical model- based analysis, the number of parameters to be estimated gets squared. Combined together, they lead to a serious lack of information.The overarching goal of this dissertation is to conduct more effective graphical model analysis for cancer gene expression studies. One literature review and three methodological projects have been conducted. The overall strategy is to borrow strength from additional information so as to assist gene expression graphical model estimation. In the first chapter, the literature review is conducted. The methods developed in Chapter 2 and Chapter 4 take advantage of information on regulators of gene expressions (such as methylation, copy number variation, microRNA, and others). As they belong to the vertical data integration framework, we first provide a review of such data integration for gene expression data in Chapter 1. Additional, graphical model-based analysis for gene expression data is reviewed. Research reported in this chapter has led to a paper published in Briefings in Bioinformat- ics. In Chapters 2-4, to accommodate the extreme complexity of information-borrowing for graphical models, three different approaches have been proposed. In Chapter 2, two graphical models, with a gene-expression-only one and a gene-expression-regulator one, are simultaneously considered. A biologically sensible hierarchy between the sparsity structures of these two networks is developed, which is the first of its kind. This hierarchy is then used to link the estimation of the two graphical models. This work has led to a paper published in Genetic Epidemiology. In Chapter 3, additional information is mined from published literature, for example, those deposited at PubMed. The consideration is that published studies have been based on many independent experiments and can contain valuable in- formation on genesā€™ interconnections. The challenge is to recognize that such information can be partial or even wrong. A two-step approach, consisting of information-guided and information-incorporated estimations, is developed. This work has led to a paper published in Biometrics. In Chapter 4, we slightly shift attention and examine the difference in graphs, which has important implications for understanding cancer development and progression. Our strategy is to link changes in gene expression graphs with those in regulator graphs, which means additional information for estimation. It is noted that to make individual chapters standing-alone, there can be minor overlapping in descriptions. All methodological developments in this research fit the advanced penalization paradigm, which has been popular for cancer gene expression and other molecular data analysis. This methodological coherence is highly desirable. For the methods described in Chapters 2- 4, we have developed new penalized estimations which have lucid interpretations and can directly lead to variable selection (and so sparse and interpretable graphs). We have also developed effective computational algorithms and R codes, which have been made publicly available at Dr. Shuangge Maā€™s Github software repository. For the methods described in Chapters 2 and 3, statistical properties under ultrahigh dimensional settings and mild regularity conditions have been established, providing the proposed methods a uniquely strong ground. Statistical properties for the method developed in Chapter 4 are relatively straightforward and hence are omitted. For all the proposed methods, we have conducted extensive simulations, comparisons with the most relevant competitors, and data analysis. The practical advantage is fully established. Overall, this research has delivered a practically sensible information-incorporating strategy for improving graphical model-based analysis for cancer gene expression data, multiple highly competitive methods, R programs that can have broad utilization, and new findings for multiple cancer types

    Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning

    Get PDF
    Over the decades, many statistical learning techniques such as supervised learning, unsupervised learning, dimension reduction technique have played ground breaking roles for important tasks in biomedical research. More recently, multi-omics data integration analysis has become increasingly popular to answer to many intractable biomedical questions, to improve statistical power by exploiting large size samples and different types omics data, and to replicate individual experiments for validation. This dissertation covers the several analytic methods and frameworks to tackle with practical problems in multi-omics data integration analysis. Supervised prediction rules have been widely applied to high-throughput omics data to predict disease diagnosis, prognosis or survival risk. The top scoring pair (TSP) algorithm is a supervised discriminant rule that applies a robust simple rank-based algorithm to identify rank-altered gene pairs in case/control classes. TSP usually generates greatly reduced accuracy in inter-study prediction (i.e., the prediction model is established in the training study and applied to an independent test study). In the first part, we introduce a MetaTSP algorithm that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. One important objective of omics data analysis is clustering unlabeled patients in order to identify meaningful disease subtypes. In the second part, we propose a group structured integrative clustering method to incorporate a sparse overlapping group lasso technique and a tight clustering via regularization to integrate inter-omics regulation flow, and to encourage outlier samples scattering away from tight clusters. We show by two real examples and simulated data that our proposed methods improve the existing integrative clustering in clustering accuracy, biological interpretation, and are able to generate coherent tight clusters. Principal component analysis (PCA) is commonly used for projection to low-dimensional space for visualization. In the third part, we introduce two meta-analysis frameworks of PCA (Meta-PCA) for analyzing multiple high-dimensional studies in common principal component space. Theoretically, Meta-PCA specializes to identify meta principal component (Meta-PC) space; (1) by decomposing the sum of variances and (2) by minimizing the sum of squared cosines. Applications to various simulated data shows that Meta-PCAs outstandingly identify true principal component space, and retain robustness to noise features and outlier samples. We also propose sparse Meta-PCAs that penalize principal components in order to selectively accommodate significant principal component projections. With several simulated and real data applications, we found Meta-PCA efficient to detect significant transcriptomic features, and to recognize visual patterns for multi-omics data sets. In the future, the success of data integration analysis will play an important role in revealing the molecular and cellular process inside multiple data, and will facilitate disease subtype discovery and characterization that improve hypothesis generation towards precision medicine, and potentially advance public health research

    Scalable Randomized Kernel Methods for Multiview Data Integration and Prediction

    Full text link
    We develop scalable randomized kernel methods for jointly associating data from multiple sources and simultaneously predicting an outcome or classifying a unit into one of two or more classes. The proposed methods model nonlinear relationships in multiview data together with predicting a clinical outcome and are capable of identifying variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. Through simulation studies, we show that the proposed methods outperform several other linear and nonlinear methods for multiview data integration. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures forCOVID-19 status and severity. Results from our real data application and simulations with small sample sizes suggest that the proposed methods may be useful for small sample size problems. Availability: Our algorithms are implemented in Pytorch and interfaced in R and would be made available at: https://github.com/lasandrall/RandMVLearn.Comment: 24 pages, 5 figures, 4 table

    Block Forests:random forests for blocks of clinical and omics covariate data

    Get PDF
    Background In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. Results We identify one variant termed ā€œblock forestā€ that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. Conclusions The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type

    Statistical Methods for Integrative Analysis, Subgroup Identification, and Variable Selection Using Cancer Genomic Data

    Get PDF
    In recent years, comprehensive cancer genomics platform, such as The Cancer Genome Atlas (TCGA), provides access to an enormous amount of high throughput genomic datasets for each patient, including gene expression, DNA copy number alteration, DNA methylation, and somatic mutation. Currently most existing analysis approaches focused only on gene-level analysis and suffered from limited interpretability and low reproducibility of findings. Additionally, with increasing availability of the modern compositional data including immune cellular fraction data and high-dimensional zero-inflated microbiome data, variable selection techniques for compositional data became of great interest because they allow inference of key immune cell types (immunology data) and key microbial species (microbiome data) associated with development and progression of various diseases. In the first dissertation aim, we address these challenges by developing a Bayesian sparse latent factor model for pathway-guided integrative genomic data analysis. Specifically, we constructed a unified framework to simultaneously identify cancer patient subgroups (clustering) and key molecular markers (variable selection) based on the joint analysis of continuous, binary and count data. In addition, we applied Polya-Gamma mixtures of normal for binary and count data to promote an exact and fully automatic posterior sampling. Moreover, pathway information was used to improve accuracy and robustness in identification of cancer patient subgroups and key molecular features. In the second dissertation aim, we developed the R package InGRiD , a comprehensive software for pathway-guided integrative genomic data analysis. We further implemented the statistical model developed in Aim 1 and provide it as a part of this software. The third dissertation aim exploits variable selection in compositional data analysis with application to immunology data and microbiome data. Specifically, we identified key immune cell types by applying a stepwise pairwise log-ratio procedure to the immune cellular fractions data, while selecting key species in the microbiome data by using zero-inflated Wilcoxon rank sum test. These approaches consider key components specific to these data types, such as compositionality (i.e., sum-to-one), zero inflation, and high dimensionality, among others. The proposed methods were developed and evaluated on: 1) large scale, high dimensional, and multi-modal datasets from the TCGA database, including gene expression, DNA copy number alteration, and somatic mutation data (Aim 1); 2) cellular fraction data induced from Colorectal Adenocarcinoma TCGA Pan-Cancer study (Aim 3); 3) high dimensional zero-inflated microbiome data from studies of colorectal cancer (Aim 3)

    A primer on correlation-based dimension reduction methods for multi-omics analysis

    Full text link
    The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will guide researchers navigate the emerging methods for multi-omics and help them integrate diverse omic datasets appropriately and embrace the opportunity of population multi-omics.Comment: 30 pages, 2 figures, 6 table
    • ā€¦
    corecore