    Bayesian Approach For Early Stage Event Prediction In Survival Data

    Predicting event occurrence at an early stage in longitudinal studies is an important and challenging problem which has high practical value. As opposed to the standard classification and regression problems where a domain expert can provide the labels for the data in a reasonably short period of time, training data in such longitudinal studies must be obtained only by waiting for the occurrence of sufficient number of events. On the other hand, survival analysis aims at finding the underlying distribution for data that measure the length of time until the occurrence of an event. However, it cannot give an answer to the open question of how to forecast whether a subject will experience event by end of study having event occurrence information at early stage of survival data?\u27\u27. This problem exhibits two major challenges: 1) absence of complete information about event occurrence (censoring) and 2) availability of only a partial set of events that occurred during the initial phase of the study. Thus, the main objective of this work is to predict for which subject in the study event will occur at future based on few event information at the initial stages of a longitudinal study. In this thesis, we propose a novel approach to address the first challenge by introducing a new method for handling censored data using Kaplan-Meier estimator. The second challenge is tackled by effectively integrating Bayesian methods with an Accelerated Failure Time (AFT) model by adapting the prior probability of the event occurrence for future time points. In another word, we propose a novel Early Stage Prediction (ESP) framework for building event prediction models which are trained at early stages of longitudinal studies. More specifically, we extended the Naive Bayes, Tree-Augmented Naive Bayes (TAN) and Bayesian Network methods based on the proposed framework, and developed three algorithms, namely, ESP-NB, ESP-TAN and ESP-BN, to effectively predict event occurrence using the training data obtained at early stage of the study. The proposed framework is evaluated using a wide range of synthetic and real-world benchmark datasets. Our extensive set of experiments show that the proposed ESP framework is able to more accurately predict future event occurrences using only a limited amount of training data compared to the other alternative prediction methods

    The risk of re-intervention after endovascular aortic aneurysm repair

    This thesis studies survival analysis techniques dealing with censoring to produce predictive tools that predict the risk of endovascular aortic aneurysm repair (EVAR) re-intervention. Censoring indicates that some patients do not continue follow up, so their outcome class is unknown. Methods dealing with censoring have drawbacks and cannot handle the high censoring of the two EVAR datasets collected. Therefore, this thesis presents a new solution to high censoring by modifying an approach that was incapable of differentiating between risks groups of aortic complications. Feature selection (FS) becomes complicated with censoring. Most survival FS methods depends on Cox's model, however machine learning classifiers (MLC) are preferred. Few methods adopted MLC to perform survival FS, but they cannot be used with high censoring. This thesis proposes two FS methods which use MLC to evaluate features. The two FS methods use the new solution to deal with censoring. They combine factor analysis with greedy stepwise FS search which allows eliminated features to enter the FS process. The first FS method searches for the best neural networks' configuration and subset of features. The second approach combines support vector machines, neural networks, and K nearest neighbor classifiers using simple and weighted majority voting to construct a multiple classifier system (MCS) for improving the performance of individual classifiers. It presents a new hybrid FS process by using MCS as a wrapper method and merging it with the iterated feature ranking filter method to further reduce the features. The proposed techniques outperformed FS methods based on Cox's model such as; Akaike and Bayesian information criteria, and least absolute shrinkage and selector operator in the log-rank test's p-values, sensitivity, and concordance. This proves that the proposed techniques are more powerful in correctly predicting the risk of re-intervention. Consequently, they enable doctors to set patients’ appropriate future observation plan

    On the Reliability of Machine Learning Models for Survival Analysis When Cure Is a Possibility

    [Abstract]: In classical survival analysis, it is assumed that all the individuals will experience the event of interest. However, if there is a proportion of subjects who will never experience the event, then a standard survival approach is not appropriate, and cure models should be considered instead. This paper deals with the problem of adapting a machine learning approach for classical survival analysis to a situation when cure (i.e., not suffering the event) is a possibility. Specifically, a brief review of cure models and recent machine learning methodologies is presented, and an adaptation of machine learning approaches to account for cured individuals is introduced. In order to validate the proposed methods, we present an extensive simulation study in which we compare the performance of the adapted machine learning algorithms with existing cure models. The results show the good behavior of the semiparametric or the nonparametric approaches, depending on the simulated scenario. The practical utility of the methodology is showcased through two real-world dataset illustrations. In the first one, the results show the gain of using the nonparametric mixture cure model approach. In the second example, the results show the poor performance of some machine learning methods for small sample sizes.This project was funded by the Xunta de Galicia (Axencia Galega de Innovación) Research projects COVID-19 presented in ISCIII IN845D 2020/26, Operational Program FEDER Galicia 2014–2020; by the Centro de Investigación de Galicia “CITIC”, funded by Xunta de Galicia and the European Union European Regional Development Fund (ERDF)-Galicia 2014–2020 Program, by grant ED431G 2019/01; and by the Spanish Ministerio de Economía y Competitividad (research projects PID2019-109238GB-C22 and PID2021-128045OA-I00). ALC was sponsored by the BEATRIZ GALINDO JUNIOR Spanish Grant from MICINN (Ministerio de Ciencia e Innovación) with code BGP18/00154. ALC was partially supported by the MICINN Grant PID2020-113578RB-I00 and partial support of Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C-2020-14Xunta de Galicia; IN845D 2020/2

    A New Scalable, Portable, and Memory-Efficient Predictive Analytics Framework for Predicting Time-to-Event Outcomes in Healthcare

    Time-to-event outcomes are prevalent in medical research. To handle these outcomes, as well as censored observations, statistical and survival regression methods are widely used based on the assumptions of linear association; however, clinicopathological features often exhibit nonlinear correlations. Machine learning (ML) algorithms have been recently adapted to effectively handle nonlinear correlations. One drawback of ML models is that they can model idiosyncratic features of a training dataset. Due to this overlearning, ML models perform well on the training data but are not so striking on test data. The features that we choose indirectly influence the performance of ML prediction models. With the expansion of big data in biomedical informatics, appropriate feature engineering and feature selection are vital to ML success. Also, an ensemble learning algorithm helps decrease bias and variance by combining the predictions of multiple models. In this study, we newly constructed a scalable, portable, and memory-efficient predictive analytics framework, fitting four components (feature engineering, survival analysis, feature selection, and ensemble learning) together. Our framework first employs feature engineering techniques, such as binarization, discretization, transformation, and normalization on raw dataset. The normalized feature set was applied to the Cox survival regression that produces highly correlated features relevant to the outcome.The resultant feature set was deployed to “eXtreme gradient boosting ensemble learning” (XGBoost) and Recursive Feature Elimination algorithms. XGBoost uses a gradient boosting decision tree algorithm in which new models are created sequentially that predict the residuals of prior models, which are then added together to make the final prediction. In our experiments, we analyzed a cohort of cardiac surgery patients drawn from a multi-hospital academic health system. The model evaluated 72 perioperative variables that impact an event of readmission within 30 days of discharge, derived 48 significant features, and demonstrated optimum predictive ability with feature sets ranging from 16 to 24. The area under the receiver operating characteristics observed for the feature set of 16 were 0.8816, and 0.9307 at the 35th, and 151st iteration respectively. Our model showed improved performance compared to state-of-the-art models and could be more useful for decision support in clinical settings
