264 research outputs found

    Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

    Get PDF
    Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model. Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained. Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches. Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR

    Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. A resampling study was performed to investigate the effects of different missing data methods on the performance of a prognostic model.</p> <p>Methods</p> <p>Observed data for 1000 cases were sampled with replacement from a large complete dataset of 7507 patients to obtain 500 replications. Five levels of missingness (ranging from 5% to 75%) were imposed on three covariates using a missing at random (MAR) mechanism. Five missing data methods were applied; a) complete case analysis (CC) b) single imputation using regression switching with predictive mean matching (SI), c) multiple imputation using regression switching imputation, d) multiple imputation using regression switching with predictive mean matching (MICE-PMM) and e) multiple imputation using flexible additive imputation models. A Cox proportional hazards model was fitted to each dataset and estimates for the regression coefficients and model performance measures obtained.</p> <p>Results</p> <p>CC produced biased regression coefficient estimates and inflated standard errors (SEs) with 25% or more missingness. The underestimated SE after SI resulted in poor coverage with 25% or more missingness. Of the MI approaches investigated, MI using MICE-PMM produced the least biased estimates and better model performance measures. However, this MI approach still produced biased regression coefficient estimates with 75% missingness.</p> <p>Conclusions</p> <p>Very few differences were seen between the results from all missing data approaches with 5% missingness. However, performing MI using MICE-PMM may be the preferred missing data approach for handling between 10% and 50% MAR missingness.</p

    Survival of patients with nonseminomatous germ cell cancer: a review of the IGCC classification by Cox regression and recursive partitioning

    Get PDF
    The International Germ Cell Consensus (IGCC) classification identifies good, intermediate and poor prognosis groups among patients with metastatic nonseminomatous germ cell tumours (NSGCT). It uses the risk factors primary site, presence of nonpulmonary visceral metastases and tumour markers alpha-fetoprotein (AFP), human chorionic gonadotrophin (HCG) and lactic dehydrogenase (LDH). The IGCC classification is easy to use and remember, but lacks flexibility. We aimed to examine the extent of any loss in discrimination within the IGCC classification in comparison with alternative modelling by formal weighing of the risk factors. We analysed survival of 3048 NSGCT patients with Cox regression and recursive partitioning for alternative classifications. Good, intermediate and poor prognosis groups were based on predicted 5-year survival. Classifications were further refined by subgrouping within the poor prognosis group. Performance was measured primarily by a bootstrap corrected c-statistic to indicate discriminative ability for future patients. The weights of the risk factors in the alternative classifications differed slightly from the implicit weights in the IGCC classification. Discriminative ability, however, did not increase clearly (IGCC classification, c=0.732; Cox classification, c=0.730; Recursive partitioning classification, c=0.709). Three subgroups could be identified within the poor prognosis groups, resulting in classifications with five prognostic groups and slightly better discriminative ability (c = 0.740). In conclusion, the IGCC classification in three prognostic groups is largely supported by Cox regression and recursive partitioning. Cox regression was the most promising tool to define a more refined classification

    Multiple imputation for estimating hazard ratios and predictive abilities in case-cohort surveys

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The weighted estimators generally used for analyzing case-cohort studies are not fully efficient and naive estimates of the predictive ability of a model from case-cohort data depend on the subcohort size. However, case-cohort studies represent a special type of incomplete data, and methods for analyzing incomplete data should be appropriate, in particular multiple imputation (MI).</p> <p>Methods</p> <p>We performed simulations to validate the MI approach for estimating hazard ratios and the predictive ability of a model or of an additional variable in case-cohort surveys. As an illustration, we analyzed a case-cohort survey from the Three-City study to estimate the predictive ability of D-dimer plasma concentration on coronary heart disease (CHD) and on vascular dementia (VaD) risks.</p> <p>Results</p> <p>When the imputation model of the phase-2 variable was correctly specified, MI estimates of hazard ratios and predictive abilities were similar to those obtained with full data. When the imputation model was misspecified, MI could provide biased estimates of hazard ratios and predictive abilities. In the Three-City case-cohort study, elevated D-dimer levels increased the risk of VaD (hazard ratio for two consecutive tertiles = 1.69, 95%CI: 1.63-1.74). However, D-dimer levels did not improve the predictive ability of the model.</p> <p>Conclusions</p> <p>MI is a simple approach for analyzing case-cohort data and provides an easy evaluation of the predictive ability of a model or of an additional variable.</p

    Birth characteristics and the risk of childhood leukaemias and lymphomas in New Zealand: a case-control study

    Get PDF
    BACKGROUND: Some studies have found that lower parity and higher or lower social class (depending on the study) are associated with increased risks of childhood acute lymphoblastic leukaemia (ALL). Such findings have led to suggestions that infection could play a role in the causation of this disease. An earlier New Zealand study found a protective effect of parental marriage on the risk of childhood ALL, and studies elsewhere have reported increased risks in relation to older parental ages. This study aimed to assess whether lower parity, lower social class, unmarried status and older parental ages increase the risk of childhood ALL (primarily). These variables were also assessed in relation to the risks of childhood acute non-lymphoblastic leukaemia, non-Hodgkin's lymphomas and Hodgkin's disease. METHODS: A case control study was conducted. The cases were 585 children diagnosed with leukaemias or lymphomas throughout New Zealand over a 12 year period. The 585 age and sex matched controls were selected at random from birth records. Birth records from cases (via cancer registration record linkage) and from controls provided accurate data on maternal parity, social class derived from paternal occupation, maternal marital status, ages of both parents, and urban status based on the address on the birth certificate. Analysis was by conditional logistic regression. RESULTS: There were no statistically significant associations overall between childhood ALL and parity of the mother, social class, unmarried maternal status, increasing parental ages (continuous analysis), or urban status. We also found no statistically significant associations between the risks of childhood acute non-lymphoblastic leukaemia, non-Hodgkin lymphomas, or Hodgkin's disease and the variables studied. CONCLUSION: This study showed no positive results though of reasonable size, and its record linkage design minimised bias. Descriptive studies (eg of time trends of ALL) show that environmental factors must be important for some diagnoses. Work has been done on the risk of ALL in relation to chemicals (eg pesticides) and drugs, dietary factors (eg vitamins), electromagnetic fields and infectious hypotheses (to name some); but whether these or other unknown factors are truly important remains to be seen

    Optimizing the diagnostic work-up of acute uncomplicated urinary tract infections

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Most diagnostic tests for acute uncomplicated urinary tract infections (UTIs) have been previously studied in so-called single-test evaluations. In practice, however, clinicians use more than one test in the diagnostic work-up. Since test results carry overlapping information, results from single-test studies may be confounded. The primary objective of the Amsterdam Cystitis/Urinary Tract Infection Study (ACUTIS) is to determine the (additional) diagnostic value of relevant tests from patient history and laboratory investigations, taking into account their mutual dependencies. Consequently, after suitable validation, an easy to use, multivariable diagnostic rule (clinical index) will be derived.</p> <p>Methods</p> <p>Women who contact their GP with painful and/or frequent micturition undergo a series of possibly relevant tests, consisting of patient history questions and laboratory investigations. Using urine culture as the reference standard, two multivariable models (diagnostic indices) will be generated: a model which assumes that patients attend the GP surgery and a model based on telephone contact only. Models will be made more robust using the bootstrap. Discrimination will be visualized in high resolution histograms of the posterior UTI probabilities and summarized as 5<sup>th</sup>, 10<sup>th</sup>, 25<sup>th </sup>50<sup>th</sup>, 75<sup>th</sup>, 90<sup>th</sup>, and 95<sup>th </sup>centiles of these, Brier score and the area under the receiver operating characteristics curve (ROC) with 95% confidence intervals. Using the regression coefficients of the independent diagnostic indicators, a diagnostic rule will be derived, consisting of an efficient set of tests and their diagnostic values.</p> <p>The course of the presenting complaints is studied using 7-day patient diaries. To learn more about the natural history of UTIs, patients will be offered the opportunity to postpone the use of antibiotics.</p> <p>Discussion</p> <p>We expect that our diagnostic rule will allow efficient diagnosis of UTIs, necessitating the collection of diagnostic indicators with proven added value. GPs may use the rule (preferably after suitable validation) to estimate UTI probabilities for women with different combinations of test results. Finally, in a subcohort, an attempt is made to identify which indicators (including antibiotic treatment) are useful to prognosticate recovery from painful and/or frequent micturition.</p

    Variable selection under multiple imputation using the bootstrap in a prognostic study

    Get PDF
    Background: Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection. Method: In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels. Results: We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found. Conclusion: We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values

    Imputation strategies for missing binary outcomes in cluster randomized trials

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Attrition, which leads to missing data, is a common problem in cluster randomized trials (CRTs), where groups of patients rather than individuals are randomized. Standard multiple imputation (MI) strategies may not be appropriate to impute missing data from CRTs since they assume independent data. In this paper, under the assumption of missing completely at random and covariate dependent missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study.</p> <p>Method</p> <p>We considered three within-cluster and three across-cluster MI strategies for missing binary outcomes in CRTs. The three within-cluster MI strategies are logistic regression method, propensity score method, and Markov chain Monte Carlo (MCMC) method, which apply standard MI strategies within each cluster. The three across-cluster MI strategies are propensity score method, random-effects (RE) logistic regression approach, and logistic regression with cluster as a fixed effect. Based on the community hypertension assessment trial (CHAT) which has complete data, we designed a simulation study to investigate the performance of above MI strategies.</p> <p>Results</p> <p>The estimated treatment effect and its 95% confidence interval (CI) from generalized estimating equations (GEE) model based on the CHAT complete dataset are 1.14 (0.76 1.70). When 30% of binary outcome are missing completely at random, a simulation study shows that the estimated treatment effects and the corresponding 95% CIs from GEE model are 1.15 (0.76 1.75) if complete case analysis is used, 1.12 (0.72 1.73) if within-cluster MCMC method is used, 1.21 (0.80 1.81) if across-cluster RE logistic regression is used, and 1.16 (0.82 1.64) if standard logistic regression which does not account for clustering is used.</p> <p>Conclusion</p> <p>When the percentage of missing data is low or intra-cluster correlation coefficient is small, different approaches for handling missing binary outcome data generate quite similar results. When the percentage of missing data is large, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar.</p

    A New Calibrated Bayesian Internal Goodness-of-Fit Method: Sampled Posterior p-Values as Simple and General p-Values That Allow Double Use of the Data

    Get PDF
    Background: Recent approaches mixing frequentist principles with Bayesian inference propose internal goodness-of-fit (GOF) p-values that might be valuable for critical analysis of Bayesian statistical models. However, GOF p-values developed to date only have known probability distributions under restrictive conditions. As a result, no known GOF p-value has a known probability distribution for any discrepancy function. Methodology/Principal Findings: We show mathematically that a new GOF p-value, called the sampled posterior p-value (SPP), asymptotically has a uniform probability distribution whatever the discrepancy function. In a moderate finite sample context, simulations also showed that the SPP appears stable to relatively uninformative misspecifications of the prior distribution. Conclusions/Significance: These reasons, together with its numerical simplicity, make the SPP a better canonical GOF p-value than existing GOF p-values
    • …
    corecore