504 research outputs found

    Stratified Pathway Analysis to Identify Gene Sets Associated with Oral Contraceptive Use and Breast Cancer

    Get PDF
    published_or_final_versio

    Improving Prognostic Models In Breast Cancer With Biostatistical Analysis Of The Phosphatidyl Inositol 3-Kinase Pathway

    Get PDF
    IMPROVING PROGNOSTIC MODELS IN BREAST CANCER WITH BIOSTATISTICAL ANALYSIS OF THE PI3-KINASE PATHWAY. Elliot James Rapp, Jena P. Giltnane, David L. Rimm, Annette Molinaro. Department of Biostatistics, Yale School of Public Health, Yale University School of Medicine, New Haven, CT. Our hypothesis was that prognostic models for breast cancer that incorporate both clinical variables and biomarkers in the PI3 Kinase molecular pathway will improve upon the clinical models of TNM staging and the Nottingham Prognostic Index (NPI). Our specific aim was to develop models that misclassify fewer patients than TNM and NPI with the outcome of dead of disease at ten years. Our population cohort was the YTMA49 cohort, a series of 688 samples of invasive ductal breast carcinoma collected between 1961 and 1983 by the Yale University Department of Pathology. Tissue MicroArray (TMA) analysis was performed and biomarker expression level was determined using Automated Quantitative Analysis (AQUA) technology for thirteen biomarkers in the PI3 Kinase pathway, including an overall expression level and expression levels by subcellular compartment. Eleven clinical variables were also assembled from our cohort. Exhaustively searching the multivariate space, we used logistic regression to predict our outcome of dead of disease at ten years. Validation was performed using Leave One Out Cross Validation (LOOCV). Misclassification estimates provided the means to compare different models, with lower misclassification estimates indicating superior models. Confidence intervals were constructed using bootstrapping with one thousand iterations. We developed a helper computer program named Combination Magic to enable us to develop sophisticated models that included both interactions between variables and transformations of variables (e.g. logarithm). Overall our best univariate models were NPI (misclassification estimate (ME): 0.326, confidence interval (CI): 0.292 to 0.359), Nodal status (ME: 0.353, CI: 0.322 to 0.493), and TNM (ME: 0.367, CI: 0.313 to 0.447). Our best univariate models from the PI3 Kinase biomarkers were FOX01_NU (ME: 0.369, CI: 0.336 to 0.415), AKT1_TM (ME: 0.373, CI: 0.335 to 0.412), and PI3Kp110_TM (ME: 0.377, CI: 0.343 to 0.431). Our best bivariate models were pTumor*PathER (ME: 0.328, CI: 0.308 to 0.443), pNode + NuGrade (ME: 0.333, CI: 0.305 to 0.434), and AKT1_NN + Fox01_NU (ME: 0.338, CI: 0.307 to 0.391). Our best trivariate models were pTumor + mTOR_NN + PI3Kp110_TM + pTumor*PI3Kp110_TM (ME: 0.296, CI: 0.273 to 0.375), pTumor + AKT1_NU + Fox01_NU + pTumor*AKT1_NU (ME: 0.298, CI: 0.275 to 0.38), and pTumor + mTOR_TM + PI3Kp110_TM + pTumor*PI3Kp110_TM (ME: 0.299, CI: 0.276 to 0.378). Our best multi-variate model was Fox01_NU + AKT1_NU + mTOR_MB + p70S6K_NU + AVG_BCL2_TM + Fox01_NU*AKT1_NU*mTOR_MB (ME: 0.295, CI: 0.274 to 0.393). None of these models was statistically superior to the clinical models of TNM and NPI

    Statistical aspect of translational and correlative studies in clinical trials

    Get PDF
    In this article, we describe statistical issues related to the conduct of translational and correlative studies in cancer clinical trials. In the era of personalized medicine, proper biomarker discovery and validation is crucial for producing groundbreaking research. In order to carry out the framework outlined in this article, a team effort between oncologists and statisticians is the key for success.published_or_final_versio

    Application and Extension of Weighted Quantile Sum Regression for the Development of a Clinical Risk Prediction Tool

    Get PDF
    In clinical settings, the diagnosis of medical conditions is often aided by measurement of various serum biomarkers through the use of laboratory tests. These biomarkers provide information about different aspects of a patient’s health and the overall function of different organs. In this dissertation, we develop and validate a weighted composite index that aggregates the information from a variety of health biomarkers covering multiple organ systems. The index can be used for predicting all-cause mortality and could also be used as a holistic measure of overall physiological health status. We refer to it as the Health Status Metric (HSM). Validation analysis shows that the HSM is predictive of long-term mortality risk and exhibits a robust association with concurrent chronic conditions, recent hospital utilization, and self-rated health. We develop the HSM using Weighted Quantile Sum (WQS) regression (Gennings et al., 2013; Carrico, 2013), a novel penalized regression technique that imposes nonnegativity and unit-sum constraints on the coefficients used to weight index components. In this dissertation, we develop a number of extensions to the WQS regression technique and apply them to the construction of the HSM. We introduce a new guided approach for the standardization of index components which accounts for potential nonlinear relationships with the outcome of interest. An extended version of the WQS that accommodates interaction effects among index components is also developed and implemented. In addition, we demonstrate that ensemble learning methods borrowed from the field of machine learning can be used to improve the predictive power of the WQS index. Specifically, we show that the use of techniques such as weighted bagging, the random subspace method and stacked generalization in conjunction with the WQS model can produce an index with substantially enhanced predictive accuracy. Finally, practical applications of the HSM are explored. A comparative study is performed to evaluate the feasibility and effectiveness of a number of ‘real-time’ imputation strategies in potential software applications for computing the HSM. In addition, the efficacy of the HSM as a predictor of hospital readmission is assessed in a cohort of emergency department patients

    Precision medicine methodology development with application to survival and genomics data

    Get PDF
    Precision medicine and genomics data provide chances for better decision making in the public health domain. In this dissertation, we develop some important elements of precision medicine and address some aspects of genomics data. The first element is developing a nonparametric regression method for interval censored data. We develop a method called Interval Censored Recursive Forests (ICRF), an iterative random forest survival estimator for interval censored data. This method solves the splitting bias problem in tree-based methods for censored data. For this task, we develop consistent splitting rules and employ a recursion technique. This estimator is uniformly consistent and shows high prediction accuracy in simulations and data analyses. Second, we develop an estimator of the optimal dynamic treatment regime (DTR) for survival outcomes with dependent censoring. When one wants to maximize the survival time or the survival probability of cancer patients who go through multiple rounds of chemotherapies, finding the dynamic optimal treatment regime is complicated by the incompleteness of the survival information. Some patients may drop out or face failure before going through all the preplanned treatment stages, which results in a different number of treatment stages for different patients. To address this issue, we generalize the Q-learning approach and the random survival forest framework. This new method also overcomes limitations of the existing methods---independent censoring or a strong modeling structure of the failure time. We show consistency of the value of the estimator and illustrate the performance of the method through simulations and analysis of the leukemia patient data and the national mortality data. Third, we develop a method that measures gene-gene associations after adjusting for the dropout events in single cell RNA sequencing (scRNA-seq) data. Posing a bivariate zero-inflated negative binomial (BZINB) model, we estimate the dropout probability and measure the underlying correlation after controlling for the dropout effects. The gene-gene association measured in this way can serve as a building block of gene set testing methods. The BZINB model has a straightforward latent variable interpretation and is estimated using the EM algorithm.Doctor of Philosoph

    Applying Neural Network Models to Predict Recurrent Maltreatment in Child Welfare Cases with Static and Dynamic Risk Factors

    Get PDF
    Risk assessment in child welfare has a long tradition of being based on models that assume the likelihood of recurrent maltreatment is a linear function of its various predictors: Gambrill & Shlonsky, 2000). Despite repeated testing of many child, parent, family, maltreatment incident, and service delivery variables, no consistent set of findings have emerged to describe the set of risk and protective factors that best account for increases and decreases in the likelihood of recurrent maltreatment. Shifts in predictors\u27 statistical significance, strength, and direction of effects coupled with evidence of risk assessment models\u27 poor predictive accuracy have led to questions regarding the fit between assumptions of linearity and the true relationship between the likelihood of recurrent maltreatment and its predictors: Gambrill & Shlonsky, 2000, 2001; Knoke & Trocmé, 2005). Hence, this dissertation study uses a distinctly nonlinear approach to modeling the likelihood of recurrent maltreatment by employing a combination of random forest and neural network models to identify the predictors that best explain the risk of recurrent maltreatment. The risk of recurrent maltreatment was assessed for a cohort of children living in a large Midwestern metropolitan area who were first reported for maltreatment between January 1, 1993 and January 1, 2002. Administrative child welfare records for 6,747 children were merged with administrative records from income maintenance, mental health, special education, juvenile justice, and criminal justice systems in order to identify the effects that various public sector service system contacts have on the risk of recurrent maltreatment. Each child was followed for a period of at least seven years to identify the risk of recurrent maltreatment in relationship to a second report for maltreatment. Post-hoc analyses comparing the predictive validity of the neural network model and a binary logistic regression model with random intercepts shows that the neural network model was superior in its predictive validity with an area under the ROC curve of 0.7825 in comparison with an area under the ROC curve of 0.7552 for the logistic regression model. Additional post-hoc analyses provided empirical insight into the four prominent risk factors and four risk moderating service variables that best explain variation in the risk of recurrent maltreatment. Specifically, the number of income maintenance spells received, community-level poverty, the child\u27s age at the first maltreatment report, and the parent\u27s status as the perpetrator of the first maltreatment incident defined 21 risk-based groups where the average probability of recurrent maltreatment was dependent upon values for the four primary risk factors, and the risk of maltreatment was moderated by juvenile court involvement, special education eligibility, receipt of CPS family centered services, and the child\u27s receipt of a mental health/substance abuse service in the community. Findings are discussed within a Risk-Need-Responsivity theory of service delivery: Andrews & Bonta, 2006), which links the empiricism of risk assessment with the clinical implementation of a preventive service delivery plan through the identified modifiable risk factors that drive the likelihood of recurrent maltreatment

    Geometry- and Accuracy-Preserving Random Forest Proximities with Applications

    Get PDF
    Many machine learning algorithms use calculated distances or similarities between data observations to make predictions, cluster similar data, visualize patterns, or generally explore the data. Most distances or similarity measures do not incorporate known data labels and are thus considered unsupervised. Supervised methods for measuring distance exist which incorporate data labels and thereby exaggerate separation between data points of different classes. This approach tends to distort the natural structure of the data. Instead of following similar approaches, we leverage a popular algorithm used for making data-driven predictions, known as random forests, to naturally incorporate data labels into similarity measures known as random forest proximities. In this dissertation, we explore previously defined random forest proximities and demonstrate their weaknesses in popular proximity-based applications. Additionally, we develop a new proximity definition that can be used to recreate the random forest’s predictions. We call these random forest-geometry-and accuracy-Preserving proximities or RF-GAP. We show by proof and empirical demonstration can be used to perfectly reconstruct the random forest’s predictions and, as a result, we argue that RF-GAP proximities provide a truer representation of the random forest’s learning when used in proximity-based applications. We provide evidence to suggest that RF-GAP proximities improve applications including imputing missing data, detecting outliers, and visualizing the data. We also introduce a new random forest proximity-based technique that can be used to generate 2- or 3-dimensional data representations which can be used as a tool to visually explore the data. We show that this method does well at portraying the relationship between data variables and the data labels. We show quantitatively and qualitatively that this method surpasses other existing methods for this task

    Classification of clinical outcomes using high-throughput and clinical informatics.

    Get PDF
    It is widely recognized that many cancer therapies are effective only for a subset of patients. However clinical studies are most often powered to detect an overall treatment effect. To address this issue, classification methods are increasingly being used to predict a subset of patients which respond differently to treatment. This study begins with a brief history of classification methods with an emphasis on applications involving melanoma. Nonparametric methods suitable for predicting subsets of patients responding differently to treatment are then reviewed. Each method has different ways of incorporating continuous, categorical, clinical and high-throughput covariates. For nonparametric and parametric methods, distance measures specific to the method are used to make classification decisions. Approaches are outlined which employ these distances to measure treatment interactions and predict patients more sensitive to treatment. Simulations are also carried out to examine empirical power of some of these classification methods in an adaptive signature design. Results were compared with logistic regression models. It was found that parametric and nonparametric methods performed reasonably well. Relative performance of the methods depends on the simulation scenario. Finally a method was developed to evaluate power and sample size needed for an adaptive signature design in order to predict the subset of patients sensitive to treatment. It is hoped that this study will stimulate more development of nonparametric and parametric methods to predict subsets of patients responding differently to treatment
    • …
    corecore