82 research outputs found

    Survival associated pathway identification with group Lp penalized global AUC maximization

    Get PDF
    It has been demonstrated that genes in a cell do not act independently. They interact with one another to complete certain biological processes or to implement certain molecular functions. How to incorporate biological pathways or functional groups into the model and identify survival associated gene pathways is still a challenging problem. In this paper, we propose a novel iterative gradient based method for survival analysis with group Lp penalized global AUC summary maximization. Unlike LASSO, Lp (p < 1) (with its special implementation entitled adaptive LASSO) is asymptotic unbiased and has oracle properties [1]. We first extend Lp for individual gene identification to group Lp penalty for pathway selection, and then develop a novel iterative gradient algorithm for penalized global AUC summary maximization (IGGAUCS). This method incorporates the genetic pathways into global AUC summary maximization and identifies survival associated pathways instead of individual genes. The tuning parameters are determined using 10-fold cross validation with training data only. The prediction performance is evaluated using test data. We apply the proposed method to survival outcome analysis with gene expression profile and identify multiple pathways simultaneously. Experimental results with simulation and gene expression data demonstrate that the proposed procedures can be used for identifying important biological pathways that are related to survival phenotype and for building a parsimonious model for predicting the survival times

    Kernel based methods for accelerated failure time model with ultra-high dimensional data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with <it>L</it><sub>1 </sub>and <it>L<sub>p </sub></it>penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size <it>n </it>≪ <it>m </it>(the number of genes), directly identifying a small subset of genes from ultra-high (<it>m </it>> 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.</p> <p>Results</p> <p>The accelerated failure time (AFT) model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller <it>n </it>× <it>n </it>matrix. It is very efficient when the number of unknown variables (genes) is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited.</p> <p>Conclusions</p> <p>Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.</p

    Modeling and simulation applications with potential impact in drug development and patient care

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Model-based drug development has become an essential element to potentially make drug development more productive by assessing the data using mathematical and statistical approaches to construct and utilize models to increase the understanding of the drug and disease. The modeling and simulation approach not only quantifies the exposure-response relationship, and the level of variability, but also identifies the potential contributors to the variability. I hypothesized that the modeling and simulation approach can: 1) leverage our understanding of pharmacokinetic-pharmacodynamic (PK-PD) relationship from pre-clinical system to human; 2) quantitatively capture the drug impact on patients; 3) evaluate clinical trial designs; and 4) identify potential contributors to drug toxicity and efficacy. The major findings for these studies included: 1) a translational PK modeling approach that predicted clozapine and norclozapine central nervous system exposures in humans relating these exposures to receptor binding kinetics at multiple receptors; 2) a population pharmacokinetic analysis of a study of sertraline in depressed elderly patients with Alzheimer’s disease that identified site specific differences in drug exposure contributing to the overall variability in sertraline exposure; 3) the utility of a longitudinal tumor dynamic model developed by the Food and Drug Administration for predicting survival in non-small cell lung cancer patients, including an exploration of the limitations of this approach; 4) a Monte Carlo clinical trial simulation approach that was used to evaluate a pre-defined oncology trial with a sparse drug concentration sampling schedule with the aim to quantify how well individual drug exposures, random variability, and the food effects of abiraterone and nilotinib were determined under these conditions; 5) a time to event analysis that facilitated the identification of candidate genes including polymorphisms associated with vincristine-induced neuropathy from several association analyses in childhood acute lymphoblastic leukemia (ALL) patients; and 6) a LASSO penalized regression model that predicted vincristine-induced neuropathy and relapse in ALL patients and provided the basis for a risk assessment of the population. Overall, results from this dissertation provide an improved understanding of treatment effect in patients with an assessment of PK/PD combined and with a risk evaluation of drug toxicity and efficacy

    Modeling and prediction of advanced prostate cancer

    Get PDF
    Background: Prostate cancer (PCa) is the most commonly diagnosed cancer and second leading cause of cancer-related deaths for men in Western countries. The advanced form of the disease is life-threatening with few options for curative therapies. The development of novel therapeutic alternatives would greatly benefit from a more comprehensive and tailored mathematical and statistical methodology. In particular, statistical inference of treatment effects and the prediction of time-dependent effects in both preclinical and clinical studies remains a challenging yet interesting opportunity for applied mathematicians. Such methods are likely to improve the reproducibility and translatability of results and offer possibility for novel holistic insights into disease progression, diagnosis, and prognosis. Methods: Several novel statistical and mathematical techniques were developed over the course of this thesis work for the in vivo modeling of PCa treatment responses. A matching-based, blinded randomized allocation procedure for preclinical experiments was developed that provides assistance for the statistical design of animal intervention studies, e.g., through power analysis and accounting for the stratification of individuals. For the post-intervention testing of treatment effects, two novel mixed-effects models were developed that aim to address the characteristic challenges of preclinical longitudinal experiments, including the heterogeneous response profiles observed in animal studies. Subsequently, a Finnish clinical PCa hospital registry cohort was inspected with a strong emphasis on prostate-specific antigen (PSA), the most commonly used PCa marker. After exploring the PSA trends using penalized splines, a generalized mixed-effects prediction model was implemented with a focus on the ultra-sensitive range of the PSA assay. Finally, for metastatic, aggressive PCa, an ensemble Cox regression methodology was developed for overall survival prediction in the DREAM 9.5 mCRPC Challenge based on open datasets from controlled clinical trials. Results: The advantages of the improved experimental design and two proposed statistical models were demonstrated in terms of both increased statistical power and accuracy in simulated and real preclinical testing settings. Penalized regression models applied to the clinical patient datasets support the use of PSA in the ultra-sensitive range together with a model for relapse prediction. Furthermore, the novel ensemble-based Cox regression model that was developed for the overall survival prediction in advanced PCa outperformed the state-of-the-art benchmark and all other models submitted to the Challenge and provided novel predictors of disease progression and treatment responses. Conclusions: The methods and results provide preclinical researchers and clinicians with novel tools for comprehensive modeling and prediction of PCa. All methodology is available as open source R statistical software packages and/or web-based graphical user interfaces

    Bayesian Variable Selection in High Dimensional Genomic Studies Using Nonlocal Priors

    Get PDF
    The advent of new genomic technologies has resulted in production of massive data sets. The outcomes in such experiments are often binary vectors or survival times, and the covariates are gene expressions obtained from thousands of genes under study. Analysis of these data, especially gene selection for a specific outcome, requires new statistical and computational methods. In this dissertation, I address this problem and propose one such method that is shown to be advantageous in selecting explanatory variables for prediction of binary responses and survival times. I adopt a Bayesian approach that utilizes a mixture of nonlocal prior densities and point masses on the regression coefficient vectors. The proposed method provides improved performance in identifying true models and reducing estimation and prediction error rates in a number of simulation studies for both binary and survival outcomes. I also describe a computational algorithm that can be used to implement the methodology in ultrahigh-dimensional settings (p ≫ n). In particular, for survival response datasets I show that MCMC is not feasible and instead provide a computational algorithm based on a stochastic search algorithm that is scalable and p invariant. As part of the variable selection methodology, I also propose a novel approach for setting prior hyperparameters by examining the total variation distance between the prior distributions on the regression parameters and the distribution of the maximum likelihood estimator under the null distribution. An R package, BVSNLP, is also introduced in this dissertation as a final product which contains all described methodology here. It performs high dimensional Bayesian variable selection for binary and survival outcome datasets that is expected to have a variety of applications including cancer genomic studies. Another problem that is addressed in this dissertation is methodology for deriving and extending Uniformly Most Powerful Bayesian tests (UMPBTs) from exponential family distributions to a larger class of testing contexts. UMPBTs are an objective class of Bayesian hypothesis tests that can be considered the Bayesian counterpart of classical uniformly most powerful tests. However, they have previously been exposed for application in one parameter exponential family models. I introduce sufficient conditions for the existence of UMPBTs and propose a unified approach for their derivation. An important application of my methodology is the extension of UMPBTs to testing whether the noncentrality parameter of a x^2 distribution is zero

    Variable selection via penalized regression and the genetic algorithm using information complexity, with applications for high-dimensional -omics data

    Get PDF
    This dissertation is a collection of examples, algorithms, and techniques for researchers interested in selecting influential variables from statistical regression models. Chapters 1, 2, and 3 provide background information that will be used throughout the remaining chapters, on topics including but not limited to information complexity, model selection, covariance estimation, stepwise variable selection, penalized regression, and especially the genetic algorithm (GA) approach to variable subsetting. In chapter 4, we fully develop the framework for performing GA subset selection in logistic regression models. We present advantages of this approach against stepwise and elastic net regularized regression in selecting variables from a classical set of ICU data. We further compare these results to an entirely new procedure for variable selection developed explicitly for this dissertation, called the post hoc adjustment of measured effects (PHAME). In chapter 5, we reproduce many of the same results from chapter 4 for the first time in a multinomial logistic regression setting. The utility and convenience of the PHAME procedure is demonstrated on a set of cancer genomic data. Chapter 6 marks a departure from supervised learning problems as we shift our focus to unsupervised problems involving mixture distributions of count data from epidemiologic fields. We start off by reintroducing Minimum Hellinger Distance estimation alongside model selection techniques as a worthy alternative to the EM algorithm for generating mixtures of Poisson distributions. We also create for the first time a GA that derives mixtures of negative binomial distributions. The work from chapter 6 is incorporated into chapters 7 and 8, where we conclude the dissertation with a novel analysis of mixtures of count data regression models. We provide algorithms based on single and multi-target genetic algorithms which solve the mixture of penalized count data regression models problem, and demonstrate the usefulness of this technique on HIV count data that were used in a previous study published by Gray, Massaro, et al. (2015) as well as on time-to-event data taken from the cancer genomic data sets from earlier

    Predicting normal tissue toxicity in radiotherapy : can we improve clinical decision-making?

    Get PDF
    Variation exists between individuals in the severity of their normal tissue response to radiotherapy. This can broadly be related to radiation dosimetric variables, adjuvant cancer treatments and factors inherent to the patient, including genetics. In this PhD research, factors were identified that influence the development of acute toxicity in breast and prostate cancer patients. In addition, integrated prediction models, including genetics, were developed that are able to predict which cancer patients are most likely to develop late urinary toxicity in prostate cancer patients

    Proceedings of the 38th International Workshop on Statistical Modelling

    Get PDF

    Rational Design of Small-Molecule Inhibitors of Protein-Protein Interactions: Application to the Oncogenic c-Myc/Max Interaction

    Get PDF
    Protein-protein interactions (PPIs) constitute an emerging class of targets for pharmaceutical intervention pursued by both industry and academia. Despite their fundamental role in many biological processes and diseases such as cancer, PPIs are still largely underrepresented in today's drug discovery. This dissertation describes novel computational approaches developed to facilitate the discovery/design of small-molecule inhibitors of PPIs, using the oncogenic c-Myc/Max interaction as a case study.First, we critically review current approaches and limitations to the discovery of small-molecule inhibitors of PPIs and we provide examples from the literature.Second, we examine the role of protein flexibility in molecular recognition and binding, and we review recent advances in the application of Elastic Network Models (ENMs) to modeling the global conformational changes of proteins observed upon ligand binding. The agreement between predicted soft modes of motions and structural changes experimentally observed upon ligand binding supports the view that ligand binding is facilitated, if not enabled, by the intrinsic (pre-existing) motions thermally accessible to the protein in the unliganded form.Third, we develop a new method for generating models of the bioactive conformations of molecules in the absence of protein structure, by identifying a set of conformations (from different molecules) that are most mutually similar in terms of both their shape and chemical features. We show how to solve the problem using an Integer Linear Programming formulation of the maximum-edge weight clique problem. In addition, we present the application of the method to known c-Myc/Max inhibitors.Fourth, we propose an innovative methodology for molecular mimicry design. We show how the structure of the c-Myc/Max complex was exploited to designing compounds that mimic the binding interactions that Max makes with the leucine zipper domain of c-Myc.In summary, the approaches described in this dissertation constitute important contributions to the fields of computational biology and computer-aided drug discovery, which combine biophysical insights and computational methods to expedite the discovery of novel inhibitors of PPIs
    corecore