4,524 research outputs found

    A Comparison of Methods for Modelling Survival Time for Cancer Patients

    Get PDF
    I denne oppgaven brukte vi tre typer overlevelsesanalysemodeller for å modellere overlevelsestiden til pasienter som lider av endetarmskreft og pasienter som lider av hode- og halskreft. Disse modellene var Cox-regresjon, Aalens additive regresjonsmodell og akselererte levetidsmodeller. Målet med denne oppgaven var å sammenligne den målte ytelsen til disse modellene ved hjelp av å bruke concordance index og Brier score som ytelsesberegninger. Vi estimerte disse ved å bruke en metode som heter "repeated stratified k-folds" for å kryssvalidere de målte resultatene. Vi delte datasettene opp i fire og gjentok dette 25 ganger, for å oppnå totalt 100 "folds". Dette gav oss muligheten til å kalkulere ytelsesberegningene 100 ganger per modell. Vi benyttet denne løsningen på begge datasettene. Cox-regresjon oppnådde høyest concordance index på begge datasettene. For å forstå modellenes nøyaktighet de første fem årene visualiserte vi Brier scoren over tidsperioden tolv til 60 måneder. Alle modellene viste en trend. Dette indik- erte at modellene blir mindre nøyaktige over tid. De fleste modellene hadde svært liknende resultater målt med Brier score, men Aalens additive regresjonsmodell hadde noe svakere resultater.In this thesis, we used three types of survival-analysis models to model the overall survival time for patients suffering from rectal cancer and head and neck cancer. These models were Cox proportional hazards, Aalen’s additive fitter and acceler- ated failure time models. The goal was to compare the performance in terms of the measured concordance index and Brier scores. The performance metrics were estimated using a repeated stratified k-folds cross- validation scheme. With four splits and 25 repeats, we achieved 100 estimates of the performance for each model. This was done for both data sets. The Cox proportional hazards model achieved the highest concordance index measured on both data sets. When we visualised the measured Brier scores over the time period of 12 to 60 months in order to interpret the models’ overall performance for the five first years. All models showed a rising trend in the measured Brier score. This indicates less accurate predictions over time. The models had similar Brier scores, with the exception of Aalen’s additive fitter. This model had a slightly poorer result when time increased.

    Comparison of Pre-processing Methods and Various Machine Learning Models for Survival Analysis on Cancer Data

    Get PDF
    Colorectal cancer and cancers in the head and neck region still pose a big problem in medicine and in the healthcare sector. In 2021 alone 11 121 deaths could be accounted for due to various cancers, with colorectal and head and neck cancer being among the more common types. In today's digital age, hospitals and researchers are collecting more data than ever before. Many studies have patients where the follow-up or study has ended before an event of interest occurs. Instead of discarding those patients from observed data when applying machine learning methods and subsequently losing valuable information, survival analysis can be applied. Survival analysis utilizes the information from the censoring variable that tells whether or not the event of interest has taken place before the study has ended. In this thesis several pre-processing techniques were utilized, such as removal of outliers, feature distribution transformations and feature selection techniques. These techniques were applied together with multiple machine learning algorithms from the scikit-learn and scikit-survival library. The survival algorithms used were Regularized Cox model with elastic net (Coxnet), random survival forest, tree based gradient boosting and gradient boosting with partial least squares as base learner. These algortihms take into account the information from the censoring variable in addition to the survival time. Other machine learning algorithms used were linear regression, ridge regression and Partial least squares regression (PLSR), where the last three algorithms only use the survival time as the target and do not account for the censoring variable. Two datasets were used in this thesis, one with patients diagnosed with colorectal cancer, and the second with patients diagnosed with various head and neck cancers. Furthermore, two experiments were carried out separately and validated by the use of repeated stratified k-fold cross validation. In the first experiment the models were fitted to different feature transformations of the datasets in combination with feature selection techniques. The second experiment involved hyperparameter tuning for the survival models. There was little difference in performance between the transformations, with no improvement on the head and neck dataset, however for the high dimensional colorectal cancer dataset, powertransformation led to a very small increase of 0.02 in the concordance index. The feature selection techniques did improve the performance of four of the models, which were Linear Regression, Ridge Regression, PLSR and Coxnet. For the more advanced survival models which were Gradient Boosted and Random Survival Forest, the feature selection did in general not improve metrics, as they might have benefited from greedily selecting features and updating feature weights on their own. The best model in the first experiment for OxyTarget was Random Forest with powertransform applied before, and all features available. This resulted in a concordance index of 0.83. For the head and neck dataset both Component Wise gradient boosting, Coxnet and PLSR were able to achieve the highest concordance index with 0.77, with Coxnet able to achieve that score across all three transformations. In the second experiment, all the survival models were tuned for different hyperparameters to see if the various metrics would improve. A small performance increase could be seen for several models. However, for the dataset with colorectal cancer, a Coxnet model tuned with a low regularization strength and low l1\_ratio penalty yielded a large increase in the concordance index and resulted in the best model with a score of 0.827. For the head and neck dataset, parameter tuning the Random Survival Forest algorithm for min\_weight\_fraction\_leaf and max\_depth resulted in the best model, and a concordance of 0.787 was achieved. The research and the framework created to conduct the aforementioned experiments show that more promising ranking results while maintaining robust models can be achieved through the use of pre-processing techniques and through the utilization of all data using repeated stratified k-fold cross validation. However, as the research conducted shows, there is no universal best algorithm or method to conduct survival analysis for cancer data, as it depends on the data

    Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data

    Get PDF
    Correct classification of breast cancer sub-types is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer (TNBC) which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma (BRCA) transcriptomic data publicly available from The Cancer Genome Atlas (TCGA) data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail in the presence of these outliers, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60\% have been previously reported as biologically relevant to TNBC, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for TNBC. Out of these, JAM3, SFT2D2 and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells (DDC) outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between TNBC and non-TNBC data. The individual role of FOXA1 in TNBC and non-TNBC, and the strong FOXA1-AGR2 connection in TNBC stand out. Not only will our results contribute to the breast cancer/TNBC understanding and ultimately its management, they also show that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data

    Cross-study validation for the assessment of prediction algorithms

    Get PDF
    Motivation: Numerous competing algorithms for prediction in high-dimensional settings have been developed in the statistical and machine-learning literature. Learning algorithms and the prediction models they generate are typically evaluated on the basis of cross-validation error estimates in a few exemplary datasets. However, in most applications, the ultimate goal of prediction modeling is to provide accurate predictions for independent samples obtained in different settings. Cross-validation within exemplary datasets may not adequately reflect performance in the broader application context. Methods: We develop and implement a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets. We illustrate it via simulations and in a collection of eight estrogen-receptor positive breast cancer microarray gene-expression datasets, where the objective is predicting distant metastasis-free survival (DMFS). We computed the C-index for all pairwise combinations of training and validation datasets. We evaluate several alternatives for summarizing the pairwise validation statistics, and compare these to conventional cross-validation. Results: Our data-driven simulations and our application to survival prediction with eight breast cancer microarray datasets, suggest that standard cross-validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross-study validation. Furthermore, the ranking of learning algorithms differs, suggesting that algorithms performing best in cross-validation may be suboptimal when evaluated through independent validation. Availability: The survHD: Survival in High Dimensions package (http://www.bitbucket.org/lwaldron/survhd) will be made available through Bioconductor. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    An integrative approach unveils FOSL1 as an oncogene vulnerability in KRAS-driven lung and pancreatic cancer

    Get PDF
    KRAS mutated tumours represent a large fraction of human cancers, but the vast majority remains refractory to current clinical therapies. Thus, a deeper understanding of the molecular mechanisms triggered by KRAS oncogene may yield alternative therapeutic strategies. Here we report the identification of a common transcriptional signature across mutant KRAS cancers of distinct tissue origin that includes the transcription factor FOSL1. High FOSL1 expression identifies mutant KRAS lung and pancreatic cancer patients with the worst survival outcome. Furthermore, FOSL1 genetic inhibition is detrimental to both KRAS-driven tumour types. Mechanistically, FOSL1 links the KRAS oncogene to components of the mitotic machinery, a pathway previously postulated to function orthogonally to oncogenic KRAS. FOSL1 targets include AURKA, whose inhibition impairs viability of mutant KRAS cells. Lastly, combination of AURKA and MEK inhibitors induces a deleterious effect on mutant KRAS cells. Our findings unveil KRAS downstream effectors that provide opportunities to treat KRAS-driven cancers

    A Comparison of Methods for Data-Driven Cancer Outlier Discovery, and An Application Scheme to Semisupervised Predictive Biomarker Discovery

    Get PDF
    A core component in translational cancer research is biomarker discovery using gene expression profiling for clinical tumors. This is often based on cell line experiments; one population is sampled for inference in another. We disclose a semisupervised workflow focusing on binary (switch-like, bimodal) informative genes that are likely cancer relevant, to mitigate this non-statistical problem. Outlier detection is a key enabling technology of the workflow, and aids in identifying the focus genes

    Dynamic early identification of hip replacement implants with high revision rates. Study based on the NJR data from UK during 2004-2012

    Get PDF
    BACKGROUND: Hip replacement and hip resurfacing are common surgical procedures with an estimated risk of revision of 4% over 10 year period. Approximately 58% of hip replacements will last 25 years. Some implants have higher revision rates and early identification of poorly performing hip replacement implant brands and cup/head brand combinations is vital. AIMS: Development of a dynamic monitoring method for the revision rates of hip implants. METHODS: Data on the outcomes following the hip replacement surgery between 2004 and 2012 was obtained from the National Joint Register (NJR) in the UK. A novel dynamic algorithm based on the CUmulative SUM (CUSUM) methodology with adjustment for casemix and random frailty for an operating unit was developed and implemented to monitor the revision rates over time. The Benjamini-Hochberg FDR method was used to adjust for multiple testing of numerous hip replacement implant brands and cup/ head combinations at each time point. RESULTS: Three poorly performing cup brands and two cup/ head brand combinations have been detected. Wright Medical UK Ltd Conserve Plus Resurfacing Cup (cup o), DePuy ASR Resurfacing Cup (cup e), and Endo Plus (UK) Limited EP-Fit Plus Polyethylene cup (cup g) showed stable multiple alarms over the period of a year or longer. An addition of a random frailty term did not change the list of underperforming components. The model with added random effect was more conservative, showing less and more delayed alarms. CONCLUSIONS: Our new algorithm is an efficient method for early detection of poorly performing components in hip replacement surgery. It can also be used for similar tasks of dynamic quality monitoring in healthcare

    Risk-adjusted CUSUM control charts for shared frailty survival models with application to hip replacement outcomes: a study using the NJR dataset

    Get PDF
    Background:  Continuous monitoring of surgical outcomes after joint replacement is needed to detect which brands’ components have a higher than expected failure rate and are therefore no longer recommended to be used in surgical practice. We developed a monitoring method based on cumulative sum (CUSUM) chart specifically for this application.  Methods:  Our method entails the use of the competing risks model with the Weibull and the Gompertz hazard functions adjusted for observed covariates to approximate the baseline time-to-revision and time-to-death distributions, respectively. The correlated shared frailty terms for competing risks, corresponding to the operating unit, are also included in the model. A bootstrap-based boundary adjustment is then required for risk-adjusted CUSUM charts to guarantee a given probability of the false alarm rates. We propose a method to evaluate the CUSUM scores and the adjusted boundary for a survival model with the shared frailty terms. We also introduce a unit performance quality score based on the posterior frailty distribution. This method is illustrated using the 2003-2012 hip replacement data from the UK National Joint Registry (NJR). Results:  We found that the best model included the shared frailty for revision but not for death. This means that the competing risks of revision and death are independent in NJR data. Our method was superior to the standard NJR methodology. For one of the two monitored components, it produced alarms four years before the increased failure rate came to the attention of the UK regulatory authorities. The hazard ratios of revision across the units varied from 0.38 to 2.28. Conclusions:  An earlier detection of failure signal by our method in comparison to the standard method used by the NJR may be explained by proper risk-adjustment and the ability to accommodate time-dependent hazards. The continuous monitoring of hip replacement outcomes should include risk adjustment at both the individual and unit level
    corecore