302 research outputs found
Novel Regression Models For High-Dimensional Survival Analysis
Survival analysis aims to predict the occurrence of specific events of interest at future time points. The presence of incomplete observations due to censoring brings unique challenges in this domain and differentiates survival analysis techniques from other standard regression methods. In this thesis, we propose four models to deal with the high-dimensional survival analysis. Firstly, we propose a regularized linear regression model with weighted least-squares to handle the survival prediction in the presence of censored instances. We employ the elastic net penalty term for inducing sparsity into the linear model to effectively handle high-dimensional data. As opposed to the existing censored linear models, the parameter estimation of our model does not need any prior estimation of survival times of censored instances. The second model we proposed is a unified model for regularized parametric survival regression for an arbitrary survival distribution. We employ a generalized linear model to approximate the negative log-likelihood and use the elastic net as a sparsity-inducing penalty to effectively deal with high-dimensional data. The proposed model is then formulated as a penalized iteratively reweighted least squares and solved using a cyclical coordinate descent-based method.Considering the fact that the popularly used survival analysis methods such as Cox proportional hazard model and parametric survival regression suffer from some strict assumptions and hypotheses that are not realistic in many real-world applications. we reformulate the survival analysis problem as a multi-task learning problem in the third model which predicts the survival time by estimating the survival status at each time interval during the study duration. We propose an indicator matrix to enable the multi-task learning algorithm to handle censored instances and incorporate some of the important characteristics of survival problems such as non-negative non-increasing list structure into our model through max-heap projection. And the proposed formulation is solved via an Alternating Direction Method of Multipliers (ADMM) based algorithm. Besides above three methods which aim at solving standard survival prediction problem, we also propose a transfer learning model for survival analysis. During our study, we noticed that obtaining sufficient labeled training instances for learning a robust prediction model is a very time consuming process and can be extremely difficult in practice. Thus, we proposed a Cox based model which uses the L2,1-norm penalty to encourage source predictors and target predictors share similar sparsity patterns and hence learns a shared representation across source and target domains to improve the model performance on the target task. We demonstrate the performance of the proposed models using several real-world high-dimensional biomedical benchmark datasets and our experimental results indicate that our model outperforms other state-of-the-art related competing methods and attains very competitive performance on various datasets
Deep Learning Based Reliability Models For High Dimensional Data
The reliability estimation of products has crucial applications in various industries, particularly in current competitive markets, as it has high economic impacts. Hence, reliability analysis and failure prediction are receiving increasing attention. Reliability models based on lifetime data have been developed for different modern applications. These models are able to predict failure by incorporating the influence of covariates on time-to-failure. The covariates are factors that affect the subjects’ lifetime.
Modern technologies generate covariates which can be utilized to improve failure time prediction. However, there are several challenges to incorporate the covariates into reliability models. First, the covariates generally are high dimensional and topologically complex. Second, the existing reliability models are not efficient in modeling the effect on the complex covariates on failure time. Third, failure time information may not be available for all covariates, as collecting such information is a costly and time-consuming process.
To overcome the first challenge, we propose a statistical approach to model the complex data. The proposed model generalizes penalized logistic regression to capture the spatial properties of the data. An efficient parameter estimation method is developed to make the model practical in case of large sample sizes. To tackle the second challenge, a deep learning-based reliability model is proposed. The model can capture the complex effect of the data on failure time. A novel loss function based on the partial likelihood function is developed to train the deep learning model. Furthermore, to overcome the third difficulty, we proposed a transfer learning-based reliability model to estimate failure time based on the failure time of similar covariates. The proposed model is based on a two-level autoencoder to minimize the distribution distance of covariates. A new parameter estimation method is developed to estimate the parameter of the proposed two-level autoencoder model.
Various simulation studies are conducted to demonstrate the proposed models. The results show that the proposed models outperformed the traditional statistical and reliability models. Moreover, physical experiments on advanced high strength steel are designed to demonstrate the proposed model. As microstructure images of the steels affect the failure time of the steel, the images are considered as covariates. The results show that the proposed models predict the failure time and hazard function of the materials more accurately than existing reliability models
Recommended from our members
Personalized Medicine: Studies of Pharmacogenomics in Yeast and Cancer
Advances in microarray and sequencing technology enable the era of personalized medicine. With increasing availability of genomic assays, clinicians have started to utilize genetics and gene expression of patients to guide clinical care. Signatures of gene expression and genetic variation in genes have been associated with disease risks and response to clinical treatment. It is therefore not difficult to envision a future where each patient will have clinical care that is optimized based on his or her genetic background and genomic profiles. However, many challenges exist towards the full realization of the potential personalized medicine. The human genome is complex and we have yet to gain a better understanding of how to associate genomic data with phenotype. First, the human genome is very complex: more than 50 million sequence variants and more than 20,000 genes have been reported. Many efforts have been devoted to genome-wide association studies (GWAS) in the last decade, associating common genetic variants with common complex traits and diseases. While many associations have been identified by genome-wide association studies, most of our phenotypic variation remains unexplained, both at the level of the variants involved and the underlying mechanism. Finally, interaction between genetics and environment presents additional layer of complexity governing phenotypic variation. Currently, there is much research developing computational methods to help associate genomic features with phenotypic variation. Modeling techniques such as machine learning have been very useful in uncovering the intricate relationships between genomics and phenotype. Despite some early successes, the performance of most models is disappointing. Many models lack robustness and predictions do not replicate. In addition, many successful models work as a black box, giving good predictions of phenotypic variation but unable to reveal the underlying mechanism. In this thesis I propose two methods addressing this challenge. First, I describe an algorithm that focuses on identifying causal genomic features of phenotype. My approach assumes genomic features predictive of phenotype are more likely to be causal. The algorithm builds models that not only accurately predict the traits, but also uncover molecular mechanisms that are responsible for these traits. . The algorithm gains its power by combining regularized linear regression, causality testing and Bayesian statistics. I demonstrate the application of the algorithm on a yeast dataset, where genotype and gene expression are used to predict drug sensitivity and elucidate the underlying mechanisms. The accuracy and robustness of the algorithm are both evaluated statistically and experimentally validated. The second part of the thesis takes on a much more complicated system: cancer. The availability of genomic and drug sensitivity data of cancer cell lines has recently been made available. The challenge here is not only the increasing complexity of the system (e.g. size of genome), but also the fundamental differences between cancers and tissues. Different cancers or tissues provide different contexts influencing regulatory networks and signaling pathways. In order to account for this, I propose a method to associate contextual genomic features with drug sensitivity. The algorithm is based on information theory, Bayesian statistics, and transfer learning. The algorithm demonstrates the importance of context specificity in predictive modeling of cancer pharmacogenomics. The two complementary algorithms highlight the challenges faced in personalized medicine and the potential solutions. This thesis detailed the results and analysis that demonstrate the importance of causality and context specificity in predictive modeling of drug response, which will be crucial for us towards bringing personalized medicine in practice
Learning Theory and Approximation
The main goal of this workshop – the third one of this type at the MFO – has been to blend mathematical results from statistical learning theory and approximation theory to strengthen both disciplines and use synergistic effects to work on current research questions. Learning theory aims at modeling unknown function relations and data structures from samples in an automatic manner. Approximation theory is naturally used for the advancement and closely connected to the further development of learning theory, in particular for the exploration of new useful algorithms, and for the theoretical understanding of existing methods. Conversely, the study of learning theory also gives rise to interesting theoretical problems for approximation theory such as the approximation and sparse representation of functions or the construction of rich kernel reproducing Hilbert spaces on general metric spaces. This workshop has concentrated on the following recent topics: Pitchfork bifurcation of dynamical systems arising from mathematical foundations of cell development; regularized kernel based learning in the Big Data situation; deep learning; convergence rates of learning and online learning algorithms; numerical refinement algorithms to learning; statistical robustness of regularized kernel based learning
Biologically Interpretable, Integrative Deep Learning for Cancer Survival Analysis
Identifying complex biological processes associated to patients\u27 survival time at the cellular and molecular level is critical not only for developing new treatments for patients but also for accurate survival prediction. However, highly nonlinear and high-dimension, low-sample size (HDLSS) data cause computational challenges in survival analysis. We developed a novel family of pathway-based, sparse deep neural networks (PASNet) for cancer survival analysis. PASNet family is a biologically interpretable neural network model where nodes in the network correspond to specific genes and pathways, while capturing nonlinear and hierarchical effects of biological pathways associated with certain clinical outcomes. Furthermore, integration of heterogeneous types of biological data from biospecimen holds promise of improving survival prediction and personalized therapies in cancer. Specifically, the integration of genomic data and histopathological images enhances survival predictions and personalized treatments in cancer study, while providing an in-depth understanding of genetic mechanisms and phenotypic patterns of cancer. Two proposed models will be introduced for integrating multi-omics data and pathological images, respectively. Each model in PASNet family was evaluated by comparing the performance of current cutting-edge models with The Cancer Genome Atlas (TCGA) cancer data. In the extensive experiments, PASNet family outperformed the benchmarking methods, and the outstanding performance was statistically assessed. More importantly, PASNet family showed the capability to interpret a multi-layered biological system. A number of biological literature in GBM supported the biological interpretation of the proposed models. The open-source software of PASNet family in PyTorch is publicly available at https://github.com/DataX-JieHao
Variable Selection for Generalized Linear Mixed Models by L1-Penalized Estimation
Generalized linear mixed models are a widely used tool for modeling longitudinal data. However, their use is typically restricted to few covariates, because the presence of many predictors yields unstable estimates. The presented approach to the fitting of generalized linear mixed
models includes an L1-penalty term that enforces variable selection and shrinkage simultaneously. A gradient ascent algorithm is proposed that allows to maximize the penalized loglikelihood yielding models with reduced complexity. In contrast to common procedures it can be used in high-dimensional settings where a large number of otentially influential explanatory variables is available. The method is investigated in simulation studies and illustrated by use of real data sets
Robust angle-based transfer learning in high dimensions
Transfer learning aims to improve the performance of a target model by
leveraging data from related source populations, which is known to be
especially helpful in cases with insufficient target data. In this paper, we
study the problem of how to train a high-dimensional ridge regression model
using limited target data and existing regression models trained in
heterogeneous source populations. We consider a practical setting where only
the parameter estimates of the fitted source models are accessible, instead of
the individual-level source data. Under the setting with only one source model,
we propose a novel flexible angle-based transfer learning (angleTL) method,
which leverages the concordance between the source and the target model
parameters. We show that angleTL unifies several benchmark methods by
construction, including the target-only model trained using target data alone,
the source model fitted on source data, and distance-based transfer learning
method that incorporates the source parameter estimates and the target data
under a distance-based similarity constraint. We also provide algorithms to
effectively incorporate multiple source models accounting for the fact that
some source models may be more helpful than others. Our high-dimensional
asymptotic analysis provides interpretations and insights regarding when a
source model can be helpful to the target model, and demonstrates the
superiority of angleTL over other benchmark methods. We perform extensive
simulation studies to validate our theoretical conclusions and show the
feasibility of applying angleTL to transfer existing genetic risk prediction
models across multiple biobanks
Penalized Likelihood and Bayesian Function Selection in Regression Models
Challenging research in various fields has driven a wide range of
methodological advances in variable selection for regression models with
high-dimensional predictors. In comparison, selection of nonlinear functions in
models with additive predictors has been considered only more recently. Several
competing suggestions have been developed at about the same time and often do
not refer to each other. This article provides a state-of-the-art review on
function selection, focusing on penalized likelihood and Bayesian concepts,
relating various approaches to each other in a unified framework. In an
empirical comparison, also including boosting, we evaluate several methods
through applications to simulated and real data, thereby providing some
guidance on their performance in practice
- …