29,976 research outputs found

    Using the Machine Learning Approach to Predict Patient Survival from High-Dimensional Survival Data

    Get PDF
    Survival analysis with high dimensional data deals with the prediction of patient survival based on their gene expression data and clinical data. A crucial task for the accuracy of survival analysis in this context is to select the features highly correlated with the patient’s survival time. Since the information about class labels is hidden, existing feature selection methods in machine learning are not applicable. In contrast to classical statistical methods which address this issue with the Cox score, we propose to tackle this problem by discretizing the survival time of patients into a suitable number of subgroups via silhouettes clustering validity. To cope with patient’s censoring, we use “k-nearest neighbor” based on clinical parameters that are truly associated with survival time. These are selected using penalized logistic regression and the penalized proportional hazards model with the EM algorithm. They are then used to estimate censored survival time. Next, the estimated class label is combined with feature selection to identify a list of genes that are correlated with the survival time and classifiers are applied to this subset of genes to determine which subtype is present in a future patient. By doing so, we expect that the identified subgroups are not only biologically meaningful but also differ in terms of survival. The effectiveness and efficiency of the proposed method are demonstrated through comparisons with classical statistical methods on real-world datasets and simulation datasets

    Deep learning cardiac motion analysis for human survival prediction

    Get PDF
    Motion analysis is used in computer vision to understand the behaviour of moving objects in sequences of images. Optimising the interpretation of dynamic biological systems requires accurate and precise motion tracking as well as efficient representations of high-dimensional motion trajectories so that these can be used for prediction tasks. Here we use image sequences of the heart, acquired using cardiac magnetic resonance imaging, to create time-resolved three-dimensional segmentations using a fully convolutional network trained on anatomical shape priors. This dense motion model formed the input to a supervised denoising autoencoder (4Dsurvival), which is a hybrid network consisting of an autoencoder that learns a task-specific latent code representation trained on observed outcome data, yielding a latent representation optimised for survival prediction. To handle right-censored survival outcomes, our network used a Cox partial likelihood loss function. In a study of 302 patients the predictive accuracy (quantified by Harrell's C-index) was significantly higher (p < .0001) for our model C=0.73 (95%\% CI: 0.68 - 0.78) than the human benchmark of C=0.59 (95%\% CI: 0.53 - 0.65). This work demonstrates how a complex computer vision task using high-dimensional medical image data can efficiently predict human survival

    Gene Expression based Survival Prediction for Cancer Patients: A Topic Modeling Approach

    Full text link
    Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression (GE) data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional GE data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (~document) as a mixture over cancer-topics, where each cancer-topic is a mixture over GE values (~words). This required some extensions to the standard LDA model eg: to accommodate the "real-valued" expression values - leading to our novel "discretized" Latent Dirichlet Allocation (dLDA) procedure. We initially focus on the METABRIC dataset, which describes breast cancer patients using the r=49,576 GE values, from microarrays. Our results show that our approach provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this approach by running it on the Pan-kidney (KIPAN) dataset, over r=15,529 GE values - here using the mRNAseq modality - and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent "D-calibrated" measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach

    Predicting Pancreatic Cancer Using Support Vector Machine

    Get PDF
    This report presents an approach to predict pancreatic cancer using Support Vector Machine Classification algorithm. The research objective of this project it to predict pancreatic cancer on just genomic, just clinical and combination of genomic and clinical data. We have used real genomic data having 22,763 samples and 154 features per sample. We have also created Synthetic Clinical data having 400 samples and 7 features per sample in order to predict accuracy of just clinical data. To validate the hypothesis, we have combined synthetic clinical data with subset of features from real genomic data. In our results, we observed that prediction accuracy, precision, recall with just genomic data is 80.77%, 20%, 4%. Prediction accuracy, precision, recall with just synthetic clinical data is 93.33%, 95%, 30%. While prediction accuracy, precision, recall for combination of real genomic and synthetic clinical data is 90.83%, 10%, 5%. The combination of real genomic and synthetic clinical data decreased the accuracy since the genomic data is weakly correlated. Thus we conclude that the combination of genomic and clinical data does not improve pancreatic cancer prediction accuracy. A dataset with more significant genomic features might help to predict pancreatic cancer more accurately

    Advanced Learning Methodologies for Biomedical Applications

    Get PDF
    University of Minnesota Ph.D. dissertation. October 2017. Major: Electrical/Computer Engineering. Advisor: Vladimir Cherkassky. 1 computer file (PDF); ix, 109 pages.There has been a dramatic increase in application of statistical and machine learning methods for predictive data-analytic modeling of biomedical data. Most existing work in this area involves application of standard supervised learning techniques. Typical methods include standard classification or regression techniques, where the goal is to estimate an indicator function (classification decision rule) or real-valued function of input variables, from finite training sample. However, real-world data often contain additional information besides labeled training samples. Incorporating this additional information into learning (model estimation) leads to nonstandard/advanced learning formalizations that represent extensions of standard supervised learning. Recent examples of such advanced methodologies include semi-supervised learning (or transduction) and learning through contradiction (or Universum learning). This thesis investigates two new advanced learning methodologies along with their biomedical applications. The first one is motivated by modeling complex survival data which can incorporate future, censored, or unknown data, in addition to (traditional) labeled training data. Here we propose original formalization for predictive modeling of survival data, under the framework of Learning Using Privileged Information (LUPI) proposed by Vapnik. Survival data represents a collection of time observations about events. Our modeling goal is to predict the state (alive/dead) of a subject at a pre-determined future time point. We explore modeling of survival data as binary classification problem that incorporates additional information (such as time of death, censored/uncensored status, etc.) under LUPI framework. Then we propose two advanced constructive Support Vector Machine (SVM)-based formulations: SVM+ and Loss-Order SVM (LO-SVM). Empirical results using simulated and real-life survival data indicate that the proposed LUPI-based methods are very effective (versus classical Cox regression) when the survival time does not follow classical probabilistic assumptions. Second advanced methodology investigates a new learning paradigm for classification called Group Learning. This approach is motivated by modeling high-dimensional data when the number of input features is much larger than the number of training samples. There are two main approaches to solving such ill-posed problems: (a) selecting a small number of informative features via feature selection; (b) using all features but imposing additional complexity constraints, e.g., ridge regression, SVM, LASSO, etc. The proposed Group Learning method takes a different approach, by splitting all features into many (t) groups, and then estimating a classifier in reduced space (of dimensionality d/t). This approach effectively uses all features, but implements training in a lower-dimensional input space. Note that the formation of groups reflects application-domain knowledge. For example, in classifying of two-dimensional images represented as a set of pixels (original high-dimensional input space), appropriate groups can be formed by grouping adjacent pixels or “local patches” because adjacent pixels are known to be highly correlated. We provide empirical validation of this new methodology for two real-life applications: (a) handwritten digit recognition, and (b) predictive classification of univariate signals, e.g., prediction of epileptic seizures from intracranial electroencephalogram (iEEG) signal. Prediction of epileptic seizures is particularly challenging, due to highly unbalanced data (just 4–5 observed seizures) and patient-specific modeling. In a joint project with Mayo Clinic, we have incorporated the Group Learning approach into an SVM-based system for seizure prediction. This system performs subject-specific modeling and achieves robust prediction performance

    Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations

    Get PDF
    The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are only suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discrimatory power of a prediction rule. Specifically, we propose a component-wise boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.Comment: revised manuscript - added simulation study, additional result
    • …
    corecore