808 research outputs found

    Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations

    Get PDF
    The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are only suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discrimatory power of a prediction rule. Specifically, we propose a component-wise boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.Comment: revised manuscript - added simulation study, additional result

    An update on statistical boosting in biomedicine

    Get PDF
    Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine-learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine

    Comparison of Pre-processing Methods and Various Machine Learning Models for Survival Analysis on Cancer Data

    Get PDF
    Colorectal cancer and cancers in the head and neck region still pose a big problem in medicine and in the healthcare sector. In 2021 alone 11 121 deaths could be accounted for due to various cancers, with colorectal and head and neck cancer being among the more common types. In today's digital age, hospitals and researchers are collecting more data than ever before. Many studies have patients where the follow-up or study has ended before an event of interest occurs. Instead of discarding those patients from observed data when applying machine learning methods and subsequently losing valuable information, survival analysis can be applied. Survival analysis utilizes the information from the censoring variable that tells whether or not the event of interest has taken place before the study has ended. In this thesis several pre-processing techniques were utilized, such as removal of outliers, feature distribution transformations and feature selection techniques. These techniques were applied together with multiple machine learning algorithms from the scikit-learn and scikit-survival library. The survival algorithms used were Regularized Cox model with elastic net (Coxnet), random survival forest, tree based gradient boosting and gradient boosting with partial least squares as base learner. These algortihms take into account the information from the censoring variable in addition to the survival time. Other machine learning algorithms used were linear regression, ridge regression and Partial least squares regression (PLSR), where the last three algorithms only use the survival time as the target and do not account for the censoring variable. Two datasets were used in this thesis, one with patients diagnosed with colorectal cancer, and the second with patients diagnosed with various head and neck cancers. Furthermore, two experiments were carried out separately and validated by the use of repeated stratified k-fold cross validation. In the first experiment the models were fitted to different feature transformations of the datasets in combination with feature selection techniques. The second experiment involved hyperparameter tuning for the survival models. There was little difference in performance between the transformations, with no improvement on the head and neck dataset, however for the high dimensional colorectal cancer dataset, powertransformation led to a very small increase of 0.02 in the concordance index. The feature selection techniques did improve the performance of four of the models, which were Linear Regression, Ridge Regression, PLSR and Coxnet. For the more advanced survival models which were Gradient Boosted and Random Survival Forest, the feature selection did in general not improve metrics, as they might have benefited from greedily selecting features and updating feature weights on their own. The best model in the first experiment for OxyTarget was Random Forest with powertransform applied before, and all features available. This resulted in a concordance index of 0.83. For the head and neck dataset both Component Wise gradient boosting, Coxnet and PLSR were able to achieve the highest concordance index with 0.77, with Coxnet able to achieve that score across all three transformations. In the second experiment, all the survival models were tuned for different hyperparameters to see if the various metrics would improve. A small performance increase could be seen for several models. However, for the dataset with colorectal cancer, a Coxnet model tuned with a low regularization strength and low l1\_ratio penalty yielded a large increase in the concordance index and resulted in the best model with a score of 0.827. For the head and neck dataset, parameter tuning the Random Survival Forest algorithm for min\_weight\_fraction\_leaf and max\_depth resulted in the best model, and a concordance of 0.787 was achieved. The research and the framework created to conduct the aforementioned experiments show that more promising ranking results while maintaining robust models can be achieved through the use of pre-processing techniques and through the utilization of all data using repeated stratified k-fold cross validation. However, as the research conducted shows, there is no universal best algorithm or method to conduct survival analysis for cancer data, as it depends on the data

    Machine Learning Methods for Diagnosis, Prognosis and Prediction of Long-term Treatment Outcome of Major Depression

    Get PDF
    abstract: Major Depression, clinically called Major Depressive Disorder, is a mood disorder that affects about one eighth of population in US and is projected to be the second leading cause of disability in the world by the year 2020. Recent advances in biotechnology have enabled us to collect a great variety of data which could potentially offer us a deeper understanding of the disorder as well as advancing personalized medicine. This dissertation focuses on developing methods for three different aspects of predictive analytics related to the disorder: automatic diagnosis, prognosis, and prediction of long-term treatment outcome. The data used for each task have their specific characteristics and demonstrate unique problems. Automatic diagnosis of melancholic depression is made on the basis of metabolic profiles and micro-array gene expression profiles where the presence of missing values and strong empirical correlation between the variables is not unusual. To deal with these problems, a method of generating a representative set of features is proposed. Prognosis is made on data collected from rating scales and questionnaires which consist mainly of categorical and ordinal variables and thus favor decision tree based predictive models. Decision tree models are known for the notorious problem of overfitting. A decision tree pruning method that overcomes the shortcomings of a greedy nature and reliance on heuristics inherent in traditional decision tree pruning approaches is proposed. The method is further extended to prune Gradient Boosting Decision Tree and tested on the task of prognosis of treatment outcome. Follow-up studies evaluating the long-term effect of the treatments on patients usually measure patients' depressive symptom severity monthly, resulting in the actual time of relapse upper bounded by the observed time of relapse. To resolve such uncertainty in response, a general loss function where the hypothesis could take different forms is proposed to predict the risk of relapse in situations where only an interval for time of relapse can be derived from the observed data.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    A New Scalable, Portable, and Memory-Efficient Predictive Analytics Framework for Predicting Time-to-Event Outcomes in Healthcare

    Get PDF
    Time-to-event outcomes are prevalent in medical research. To handle these outcomes, as well as censored observations, statistical and survival regression methods are widely used based on the assumptions of linear association; however, clinicopathological features often exhibit nonlinear correlations. Machine learning (ML) algorithms have been recently adapted to effectively handle nonlinear correlations. One drawback of ML models is that they can model idiosyncratic features of a training dataset. Due to this overlearning, ML models perform well on the training data but are not so striking on test data. The features that we choose indirectly influence the performance of ML prediction models. With the expansion of big data in biomedical informatics, appropriate feature engineering and feature selection are vital to ML success. Also, an ensemble learning algorithm helps decrease bias and variance by combining the predictions of multiple models. In this study, we newly constructed a scalable, portable, and memory-efficient predictive analytics framework, fitting four components (feature engineering, survival analysis, feature selection, and ensemble learning) together. Our framework first employs feature engineering techniques, such as binarization, discretization, transformation, and normalization on raw dataset. The normalized feature set was applied to the Cox survival regression that produces highly correlated features relevant to the outcome.The resultant feature set was deployed to “eXtreme gradient boosting ensemble learning” (XGBoost) and Recursive Feature Elimination algorithms. XGBoost uses a gradient boosting decision tree algorithm in which new models are created sequentially that predict the residuals of prior models, which are then added together to make the final prediction. In our experiments, we analyzed a cohort of cardiac surgery patients drawn from a multi-hospital academic health system. The model evaluated 72 perioperative variables that impact an event of readmission within 30 days of discharge, derived 48 significant features, and demonstrated optimum predictive ability with feature sets ranging from 16 to 24. The area under the receiver operating characteristics observed for the feature set of 16 were 0.8816, and 0.9307 at the 35th, and 151st iteration respectively. Our model showed improved performance compared to state-of-the-art models and could be more useful for decision support in clinical settings

    Predicting Kidney Transplant Survival using Multiple Feature Representations for HLAs

    Full text link
    Kidney transplantation can significantly enhance living standards for people suffering from end-stage renal disease. A significant factor that affects graft survival time (the time until the transplant fails and the patient requires another transplant) for kidney transplantation is the compatibility of the Human Leukocyte Antigens (HLAs) between the donor and recipient. In this paper, we propose new biologically-relevant feature representations for incorporating HLA information into machine learning-based survival analysis algorithms. We evaluate our proposed HLA feature representations on a database of over 100,000 transplants and find that they improve prediction accuracy by about 1%, modest at the patient level but potentially significant at a societal level. Accurate prediction of survival times can improve transplant survival outcomes, enabling better allocation of donors to recipients and reducing the number of re-transplants due to graft failure with poorly matched donors
    • …
    corecore