1,469 research outputs found

    Application of Machine Learning in Cancer Research

    Full text link
    This dissertation revisits the problem of five-year survivability predictions for breast cancer using machine learning tools. This work is distinguishable from the past experiments based on the size of the training data, the unbalanced distribution of data in minority and majority classes, and modified data cleaning procedures. These experiments are also based on the principles of TIDY data and reproducible research. In order to fine-tune the predictions, a set of experiments were run using naive Bayes, decision trees, and logistic regression. Of particular interest were strategies to improve the recall level for the minority class, as the cost of misclassification is prohibitive. One of The main contributions of this work is that logistic regression with the proper predictors and class weight gives the highest precision/recall level for the minority class. In regression modeling with large number of predictors, correlation among predictors is quite common, and the estimated model coefficients might not be very reliable. In these situations, the Variance Inflation Factor (VIF) and the Generalized Variance Inflation Factor (GVIF) are used to overcome the correlation problem. Our experiments are based on the Surveillance, Epidemiology, and End Results (SEER) database for the problem of survivability prediction. Some of the specific contributions of this thesis are: · Detailed process for data cleaning and binary classification of 338,596 breast cancer patients. · Computational approach for omitting predictors and categorical predictors based on VIF and GVIF. · Various applications of Synthetic Minority Over-sampling Techniques (SMOTE) to increase precision and recall. · An application of Edited Nearest Neighbor to obtain the highest F1-measure. In addition, this work provides precise algorithms and codes for determining class membership and execution of competing methods. These codes can facilitate the reproduction and extension of our work by other researchers

    auton-survival: an Open-Source Package for Regression, Counterfactual Estimation, Evaluation and Phenotyping with Censored Time-to-Event Data

    Full text link
    Applications of machine learning in healthcare often require working with time-to-event prediction tasks including prognostication of an adverse event, re-hospitalization or death. Such outcomes are typically subject to censoring due to loss of follow up. Standard machine learning methods cannot be applied in a straightforward manner to datasets with censored outcomes. In this paper, we present auton-survival, an open-source repository of tools to streamline working with censored time-to-event or survival data. auton-survival includes tools for survival regression, adjustment in the presence of domain shift, counterfactual estimation, phenotyping for risk stratification, evaluation, as well as estimation of treatment effects. Through real world case studies employing a large subset of the SEER oncology incidence data, we demonstrate the ability of auton-survival to rapidly support data scientists in answering complex health and epidemiological questions

    Discrimination of healthy and colorectal cancer patients using FTIR and PLS-DA

    Get PDF
    Spectroscopic methods have already been used as effective tools in several studies involving the detection of cancer. Fourier transform infrared spectroscopy (FTIR) has already been applied in the discrimination of cancer cells and tissues or blood of patients with the disease, observing that this technique requires the use of chemometric algorithms to obtain such results. The aim of this study was to employ a partial least squares discriminant analysis (PLS-DA) with FTIR data in the discrimination of plasma samples from patients with colorectal cancer (RCC) and healthy individuals. Multivariate analysis was performed using PLS-DA of the sample triplicates (n=90) with different types of processing. The best PLS-DA condition was obtained using the 1st derivative, 1 orthogonal signal correction (OSC) and no pre-processing. With 1 factor only, the model presented a mean square error of cross-validation (RMSECV) of 0.0004 and coefficient of determination (r^2) of 1.0000. The accuracy, precision and sensitivity of the model were 100%

    Predicting Other Cause Mortality Risk for Older Men with Localized Prostate Cancer: A Dissertation

    Get PDF
    Background: Overtreatment of localized prostate cancer (PCa) is a concern as many men die of other causes prior to experiencing a treatment benefit. This dissertation characterizes the need for assessing other cause mortality (OCM) risk in older men with PCa and informs efforts to identify patients most likely to benefit from definitive PCa treatment. Methods: Using the linked Surveillance Epidemiology and End Results-Medicare Health Outcomes Survey database, 2,931 men (mean age=75) newly diagnosed with clinical stage T1a-T3a PCa from 1998-2009 were identified. Survival analysis methods were used to compare observed 10-year OCM by primary treatment type. Age and health factors predictive of primary treatment type were assessed with multinomial logistic regression. Predicted mortality estimates from Social Security life tables (recommended for life expectancy evaluation) and two OCM risk estimation tools were compared to observed rates. An improved OCM prediction model was developed fitting Fine and Gray competing risks models for 10-year OCM with age, sociodemographic, comorbidity, activities of daily living, and patient-reported health data as predictors. The tools’ ability to discriminate between patients who died and those who did not was evaluated with Harrell’s c-index (range 0.5-1), which also guided new model selection. Results: Fifty-four percent of older men with localized PCa underwent radiotherapy while 13% underwent prostatectomy. Twenty-three percent of those treated with radiotherapy and 12% of those undergoing prostatectomy experienced OCM within 10 years of treatment and thus were considered overtreated. Health factors indicative of a shorter life expectancy (increased comorbidity, worse physical health, smoking) had little to no association with radiotherapy assignment but were significantly related to reductions in the likelihood of undergoing prostatectomy. Social Security life tables overestimated mortality risk and discriminated poorly between men who died and those who did not over 10 years (c-index=0.59). Existing OCM risk estimation tools were less likely to overestimate OCM rates and had limited but improved discrimination (c-index=0.64). A risk model developed with self-reported age, Charlson comorbidity index score, overall health (excellent-good/fair/poor), smoking, and marital status predictors had improved discrimination (c-index=0.70). Conclusions: Overtreatment of older men with PCa is primarily attributable to radiotherapy and may be reduced by pretreatment assessment of mortality-related health factors. This dissertation provides a prognostic model which utilizes a set of five self-reported characteristics that better identify patients likely to die of OCM within 10 years of diagnosis than age and comorbidity-based assessments alone

    Predicting the 10-year risk of death from other causes in men with localized prostate cancer using patient-reported factors: Development of a tool

    Get PDF
    OBJECTIVE: To develop a tool for estimating the 10-year risk of death from other causes in men with localized prostate cancer. SUBJECTS AND METHODS: We identified 2,425 patients from the Surveillance Epidemiology and End Results-Medicare Health Outcomes Survey database, age \u3c 80, newly diagnosed with clinical stage T1-T3a prostate cancer from 1/1/1998-12/31/2009, with follow-up through 2/28/2013. We developed a Fine and Gray competing-risks model for 10-year other cause mortality considering age, patient-reported comorbid medical conditions, component scores and items of the SF-36 Health Survey, activities of daily living, and sociodemographic characteristics. Model discrimination and calibration were compared to predictions from Social Security life table mortality risk estimates. RESULTS: Over a median follow-up of 7.7 years, 76 men died of prostate-specific causes and 465 died of other causes. The strongest predictors of 10-year other cause mortality risk included increasing age at diagnosis, higher approximated Charlson Comorbidity Index score, worse patient-reported general health (fair or poor vs. excellent-good), smoking at diagnosis, and marital status (all other vs. married) (all p \u3c 0.05). Model discrimination improved over Social Security life tables (c-index of 0.70 vs. 0.59, respectively). Predictions were more accurate than predictions from the Social Security life tables, which overestimated risk in our population. CONCLUSIONS: We provide a tool for estimating the 10-year risk of dying from other causes when making decisions about treating prostate cancer using pre-treatment patient-reported characteristics

    Prognostic utility of the breast cancer index and comparison to Adjuvant! Online in a clinical case series of early breast cancer

    Get PDF
    Introduction\ud Breast Cancer Index (BCI) combines two independent biomarkers, HOXB13:IL17BR (H:I) and the 5-gene molecular grade index (MGI), that assess estrogen-mediated signalling and tumor grade, respectively. BCI stratifies early-stage estrogen-receptor positive (ER+), lymph-node negative (LN-) breast cancer patients into three risk groups and provides a continuous assessment of individual risk of distant recurrence. Objectives of the current study were to validate BCI in a clinical case series and to compare the prognostic utility of BCI and Adjuvant!Online (AO).\ud \ud Methods\ud Tumor samples from 265 ER+LN- tamoxifen-treated patients were identified from a single academic institution's cancer research registry. The BCI assay was performed and scores were assigned based on a pre-determined risk model. Risk was assessed by BCI and AO and correlated to clinical outcomes in the patient cohort.\ud \ud Results\ud BCI was a significant predictor of outcome in a cohort of 265 ER+LN- patients (median age: 56-y; median follow-up: 10.3-y), treated with adjuvant tamoxifen alone or tamoxifen with chemotherapy (32%). BCI categorized 55%, 21%, and 24% of patients as low, intermediate and high-risk, respectively. The 10-year rates of distant recurrence were 6.6%, 12.1% and 31.9% and of breast cancer-specific mortality were 3.8%, 3.6% and 22.1% in low, intermediate, and high-risk groups, respectively. In a multivariate analysis including clinicopathological factors, BCI was a significant predictor of distant recurrence (HR for 5-unit increase = 5.32 [CI 2.18-13.01; P = 0.0002]) and breast cancer-specific mortality (HR for a 5-unit increase = 9.60 [CI 3.20-28.80; P < 0.0001]). AO was significantly associated with risk of recurrence. In a separate multivariate analysis, both BCI and AO were significantly predictive of outcome. In a time-dependent (10-y) ROC curve accuracy analysis of recurrence risk, the addition of BCI+AO increased predictive accuracy in all patients from 66% (AO only) to 76% (AO+BCI) and in tamoxifen-only treated patients from 65% to 81%.\ud \ud Conclusions\ud This study validates the prognostic performance of BCI in ER+LN- patients. In this characteristically low-risk cohort, BCI classified high versus low-risk groups with ~5-fold difference in 10-year risk of distant recurrence and breast cancer-specific death. BCI and AO are independent predictors with BCI having additive utility beyond standard of care parameters that are encompassed in AO

    Clinical prediction modelling in oral health: A review of study quality and empirical examples of model development

    Get PDF
    Background Substantial efforts have been made to improve the reproducibility and reliability of scientific findings in health research. These efforts include the development of guidelines for the design, conduct and reporting of preclinical studies (ARRIVE), clinical trials (ROBINS-I, CONSORT), observational studies (STROBE), and systematic reviews and meta-analyses (PRISMA). In recent years, the use of prediction modelling has increased in the health sciences. Clinical prediction models use information at the individual patient level to estimate the probability of a health outcome(s). Such models offer the potential to assist in clinical decision-making and to improve medical care. Guidelines such as PROBAST (Prediction model Risk Of Bias Assessment Tool) have been recently published to further inform the conduct of prediction modelling studies. Related guidelines for the reporting of these studies, such as TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) instrument, have also been developed. Since the early 2000s, oral health prediction models have been used to predict the risk of various types of oral conditions, including dental caries, periodontal diseases and oral cancers. However, there is a lack of information on the methodological quality and reporting transparency of the published oral health prediction modelling studies. As a consequence, and due to the unknown quality and reliability of these studies, it remains unclear to what extent it is possible to generalise their findings and to replicate their derived models. Moreover, there remains a need to demonstrate the conduct of prediction modelling studies in oral health field following the contemporary guidelines. This doctoral project addresses these issues using two systematic reviews and two empirical analyses. This thesis is the first comprehensive and systematic project reviewing the study quality and demonstrating the use of registry data and longitudinal cohorts to develop clinical prediction models in oral health. Aims • To identify and examine the quality of existing prediction modelling studies in the major fields of oral health.• To demonstrate the conduct and reporting of a prediction modelling study following current guidelines, incorporating machine learning algorithms and accounting for multiple sources of biases. Methods As one of the most prevalent oral conditions, chronic periodontitis was chosen as the exemplar pathology for the first part of this thesis. A systematic review was conducted to investigate the existing prediction models for the incidence and progression of this condition. Based upon this initial overview, a more comprehensive critical review was conducted to assess the methodological quality and completeness of reporting for prediction modelling studies in the field of oral health. The risk of bias in the existing literature was assessed using the PROBAST criteria, and the quality of study reporting was measured in accordance with the TRIPOD guidelines. Following these two reviews, this research project demonstrated the conduct and reporting of a clinical prediction modelling study using two empirical examples. Two types of analyses that are commonly used for two different types of outcome data were adopted: survival analysis for censored outcomes and logistic regression analysis for binary outcomes. Models were developed to 1) predict the three- and five-year disease-specific survival of patients with oral and pharyngeal cancers, based on 21,154 cases collected by a large cancer registry program in the US, the Surveillance, Epidemiology and End Results (SEER) program, and 2) to predict the occurrence of acute and persistent pain following root canal treatment, based on the electronic dental records of 708 adult patients collected by the National Practice-Based Research Network. In these two case studies, all prediction models were developed in five steps: (i) framing the research question; (ii) data acquisition and pre-processing; (iii) model generation; (iv) model validation and performance evaluation; and (v) model presentation and reporting. In accordance with the PROBAST recommendations, the risk of bias during the modelling process was reduced in the following aspects: • In the first case study, three types of biases were taken into account: (i) bias due to missing data was reduced by adopting compatible methods to conduct imputation; (ii) bias due to unmeasured predictors was tested by sensitivity analysis; and (iii) bias due to the initial choice of modelling approach was addressed by comparing tree-based machine learning algorithms (survival tree, random survival forest and conditional inference forest) with the traditional statistical model (Cox regression). • In the second case study, the following strategies were employed: (i) missing data were addressed by multiple imputation with missing indicator methods; (ii) a multilevel logistic regression approach was adopted for model development in order to fit Table of Contents xi the hierarchical structure of the data; (iii) model complexity was reduced using the Least Absolute Shrinkage and Selection Operator (LASSO) for predictor selection; and (iv) the models’ predictive performance was evaluated comprehensively by using the Area Under the Precision Recall Curve (AUPRC) in addition to the Area Under the Receiver Operating Characteristic curve (AUROC); (v) finally, and most importantly, given the existing criticism in the research community concerning the gender-based and racial bias in risk prediction models, we compared the models’ predictive performance built with different sets of predictors (including a clinical set, a sociodemographic set and a combination of both, the ‘general’ set). Results The first and second review studies indicated that, in the field of oral health, the popularity of multivariable prediction models has increased in recent years. Bias and variance are two components of the uncertainty (e.g., the mean squared error) in model estimation. However, the majority of the existing studies did not account for various sources of bias, such as measurement error and inappropriate handling of missing data. Moreover, non-transparent reporting and lack of reproducibility of the models were also identified in the existing oral health prediction modelling studies. These findings provided motivation to conduct two case studies aimed at demonstrating adherence to the contemporary guidelines and to best practice. In the third study, comparable predictive capabilities between Cox regression and the non-parametric tree-based machine learning algorithms were observed for predicting the survival of patients with oral and pharyngeal cancers. For example, the C-index for a Cox model and a random survival forest in predicting three-year survival were 0.82 and 0.84, respectively. A novelty of this study was the development of an online calculator designed to provide an open and transparent estimation of patients’ survival probability for up to five years after diagnosis. This calculator has clinical translational potential and could aid in patient stratification and treatment planning, at least in the context of ongoing research. In addition, the transparent reporting of this study was achieved by following the TRIPOD checklist and sharing all data and codes. In the fourth study, LASSO regression suggested that pre-treatment clinical factors were important in the development of one-week and six-month postoperative pain following root canal treatment. Among all the developed multilevel logistic models, models with a clinical set of predictors yielded similar predictive performance to models with a general set of predictors, while the models with sociodemographic predictors showed the weakest predictive ability. For example, for predicting one-week postoperative pain, the AUROC for models with clinical, sociodemographic and general predictors were 0.82, 0.68 and 0,84, respectively, and the AUPRC were 0.66, 0.40 and 0.72, respectively. Conclusion The significance of this research project is twofold. First, prediction models have been developed for potential clinical use in the context of various oral conditions. Second, this research represents the first attempt to standardise the conduct of this type of studies in oral health research. This thesis presents three conclusions: 1) Adherence to contemporary best practice guidelines such as PROBAST and TRIPOD is limited in the field of oral health research. In response, this PhD project disseminates these guidelines and leverages their advantages to develop effective prediction models for use in dentistry and oral health. 2) Use of appropriate procedures, accounting for and adapting to multiple sources of bias in model development, produces predictive tools of increased reliability and accuracy that hold the potential to be implemented in clinical practice. Therefore, for future prediction modelling research, it is important that data analysts work towards eliminating bias, regardless of the areas in which the models are employed. 3) Machine learning algorithms provide alternatives to traditional statistical models for clinical prediction purposes. Additionally, in the presence of clinical factors, sociodemographic characteristics contribute less to the improvement of models’ predictive performance or to providing cogent explanations of the variance in the models, regardless of the modelling approach. Therefore, it is timely to reconsider the use of sociodemographic characteristics in clinical prediction modelling research. It is suggested that this is a proportionate and evidence based strategy aimed at reducing biases in healthcare risk prediction that may be derived from gender and racial characteristics inherent in sociodemographic data sets.Thesis (Ph.D.) -- University of Adelaide, School of Public Health, 202
    corecore