Search CORE

1,189 research outputs found

Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer

Author: A Marshall
AR Donders
DB Rubin
DB Rubin
DB Rubin
G Ambler
G Van der Heijden
IR White
JA Sterne
JL Schafer
JL Schafer
JL Schafer
JM Engels
JO Kim
KG Moons
NJ Horton
RA Little
RA Little
S Greenland
S van Buuren
SJ Dawson
TE Bodner
W Vach
Publication venue: Nature Publishing Group
Publication date: 01/01/2011
Field of study

Background:Tissue micro-arrays (TMAs) are increasingly used to generate data of the molecular phenotype of tumours in clinical epidemiology studies, such as studies of disease prognosis. However, TMA data are particularly prone to missingness. A variety of methods to deal with missing data are available. However, the validity of the various approaches is dependent on the structure of the missing data and there are few empirical studies dealing with missing data from molecular pathology. The purpose of this study was to investigate the results of four commonly used approaches to handling missing data from a large, multi-centre study of the molecular pathological determinants of prognosis in breast cancer.Patients and Methods:We pooled data from over 11 000 cases of invasive breast cancer from five studies that collected information on seven prognostic indicators together with survival time data. We compared the results of a multi-variate Cox regression using four approaches to handling missing data-complete case analysis (CCA), mean substitution (MS) and multiple imputation without inclusion of the outcome (MI) and multiple imputation with inclusion of the outcome (MI). We also performed an analysis in which missing data were simulated under different assumptions and the results of the four methods were compared.Results:Over half the cases had missing data on at least one of the seven variables and 11 percent had missing data on 4 or more. The multi-variate hazard ratio estimates based on multiple imputation models were very similar to those derived after using MS, with similar standard errors. Hazard ratio estimates based on the CCA were only slightly different, but the estimates were less precise as the standard errors were large. However, in data simulated to be missing completely at random (MCAR) or missing at random (MAR), estimates for MI were least biased and most accurate, whereas estimates for CCA were most biased and least accurate.Conclusion:In this study, empirical results from analyses using CCA, MS, MI and MI were similar, although results from CCA were less precise. The results from simulations suggest that in general MI is likely to be the best. Given the ease of implementing MI in standard statistical software, the results of MI and CCA should be compared in any multi-variate analysis where missing data are a problem. © 2011 Cancer Research UK. All rights reserved

Crossref

PubMed Central

Archivio della Ricerca - Università di Pisa

University of Melbourne Institutional Repository

Investigating the missing data mechanism in quality of life outcomes: a comparison of approaches

Author: A Avenell
A Grant
Craig R Ramsay
D Curran
DB Rubin
DL Fairclough
DL Fairclough
DW Hosmer
G Molenberghs
J Listing
J Listing
JR Carpenter
MS Ridout
N Schmitz
Peter M Fayers
PJ Diggle
R Brooks
RJA Little
RJA Little
S Fielding
SH Ralston
Shona Fielding
The KAT trial group
The RECORD Trial Group
Publication venue: BMC
Publication date: 01/01/2009
Field of study

Background: Missing data is classified as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). Knowing the mechanism is useful in identifying the most appropriate analysis. The first aim was to compare different methods for identifying this missing data mechanism to determine if they gave consistent conclusions. Secondly, to investigate whether the reminder-response data can be utilised to help identify the missing data mechanism. Methods: Five clinical trial datasets that employed a reminder system at follow-up were used. Some quality of life questionnaires were initially missing, but later recovered through reminders. Four methods of determining the missing data mechanism were applied. Two response data scenarios were considered. Firstly, immediate data only; secondly, all observed responses (including reminder-response). Results: In three of five trials the hypothesis tests found evidence against the MCAR assumption. Logistic regression suggested MAR, but was able to use the reminder-collected data to highlight potential MNAR data in two trials. Conclusion: The four methods were consistent in determining the missingness mechanism. One hypothesis test was preferred as it is applicable with intermittent missingness. Some inconsistencies between the two data scenarios were found. Ignoring the reminder data could potentially give a distorted view of the missingness mechanism. Utilising reminder data allowed the possibility of MNAR to be considered.The Chief Scientist Office of the Scottish Government Health Directorate. Research Training Fellowship (CZF/1/31

Aberdeen University Research

CiteSeerX

Crossref

Springer - Publisher Connector

Springer

Directory of Open Access Journals

PubMed Central

NORA - Norwegian Open Research Archives

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

Author: A Burton
A Burton
A Marshall
AH Herring
AM Wood
Andrea Marshall
DB Rubin
DB Rubin
DB Rubin
Douglas G Altman
F Barzi
FE Harrell
FE Harrell
FH Kong
HY Chen
I White
J Schafer
J Scheffer
JL Schafer
JL Schafer
JL Schafer
JL Schafer
JL Schafer
KH Li
LM Collins
LQ Tang
M Hu
N Schenker
NJ Horton
P Royston
Patrick Royston
PD Faris
R Bender
R Development Core Team
R Oostenbrink
RJA Little
Roger L Holder
S Demissie
S Greenland
S van Buuren
S van Buuren
SR Lipsitz
SR Lipsitz
TG Clark
W Sauerbrei
W Vach
XL Meng
XL Meng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model. Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained. Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches. Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR

Crossref

Springer - Publisher Connector

University of Birmingham Research Portal

PubMed Central

UCL Discovery

Warwick Research Archives Portal Repository

Oxford University Research Archive

Imputation of continuous variables missing at random using the method of simulated scores

Author: C Gourieroux
D Fadden Mc
DB Rubin
DB Rubin
DB Rubin
DB Rubin
G Calzolari
JL Schafer
NJ Horton
RA Thisted
RJA Little
TE Raghunathan
V Hajivassiliou
WH Greene
Publication venue
Publication date: 01/01/2002
Field of study

For multivariate datasets with missing values, we present a procedure of statistical inference and state its "optimal" properties. Two main assumptions are needed: (1) data are missing at random (MAR); (2) the data generating process is a multivariate normal linear regression. Disentangling the problem of convergence of the iterative estimation/imputation procedure, we show that the estimator is a "method of simulated scores" (a particular case of McFadden's "method of simulated moments"); thus the estimator is equivalent to maximum likelihood if the number of replications is conveniently large, and the whole procedure can be considered an optimal parametric technique for imputation of missing data

Munich RePEc Personal Archive

Crossref

Imputation of Continuous Variables Missing at Random using the Method of Simulated Scores

Author: C Gourieroux
D Fadden Mc
DB Rubin
DB Rubin
DB Rubin
DB Rubin
G Calzolari
JL Schafer
NJ Horton
RA Thisted
RJA Little
TE Raghunathan
V Hajivassiliou
WH Greene
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2002
Field of study

Crossref

Missing Data in Randomized Clinical Trials for Weight Loss: Scope of the Problem, State of the Field, and Performance of Statistical Methods

Author: Bret Musser
Christopher S. Coffey
CL Ogeden
CS Coffey
CS Coffey
David B. Allison
David W. Brock
DB Allison
DB Allison
DB Rubin
DG Simons-Morton
DJ Catellier
EE Calle
Erondu Ngozi
GL Gadbury
J Barnard
JL Schafer
Kaifeng Lu
Kishore M. Gadde
KM Gadde
KM Gadde
KR Fontaine
KR Fontaine
Mai A. Elobeid
Marie-Pierre St-Onge
Miguel A. Padilla
MP St-Onge
Olivia Thomas
Peter W. Gething
PS Landers
R Little
RB Cattell
RC Little
Renee A. Desmond
RJA Little
SB Heymsfield
Steven B. Heymsfield
TA Wadden
Theresa McVie
Publication venue: Public Library of Science
Publication date: 01/08/2009
Field of study

BACKGROUND: Dropouts and missing data are nearly-ubiquitous in obesity randomized controlled trails, threatening validity and generalizability of conclusions. Herein, we meta-analytically evaluate the extent of missing data, the frequency with which various analytic methods are employed to accommodate dropouts, and the performance of multiple statistical methods. METHODOLOGY/PRINCIPAL FINDINGS: We searched PubMed and Cochrane databases (2000-2006) for articles published in English and manually searched bibliographic references. Articles of pharmaceutical randomized controlled trials with weight loss or weight gain prevention as major endpoints were included. Two authors independently reviewed each publication for inclusion. 121 articles met the inclusion criteria. Two authors independently extracted treatment, sample size, drop-out rates, study duration, and statistical method used to handle missing data from all articles and resolved disagreements by consensus. In the meta-analysis, drop-out rates were substantial with the survival (non-dropout) rates being approximated by an exponential decay curve (e(-lambdat)) where lambda was estimated to be .0088 (95% bootstrap confidence interval: .0076 to .0100) and t represents time in weeks. The estimated drop-out rate at 1 year was 37%. Most studies used last observation carried forward as the primary analytic method to handle missing data. We also obtained 12 raw obesity randomized controlled trial datasets for empirical analyses. Analyses of raw randomized controlled trial data suggested that both mixed models and multiple imputation performed well, but that multiple imputation may be more robust when missing data are extensive. CONCLUSION/SIGNIFICANCE: Our analysis offers an equation for predictions of dropout rates useful for future study planning. Our raw data analyses suggests that multiple imputation is better than other methods for handling missing data in obesity randomized controlled trials, followed closely by mixed models. We suggest these methods supplant last observation carried forward as the primary method of analysis

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

DukeSpace

Multiple Imputation Ensembles (MIE) for dealing with missing data

Author: A Farhangfar
AM Sefidian
B Schölkopf
C Cortes
CT Tran
DA Newman
DB Rubin
DB Rubin
DH Wolpert
EL Silva-Ramírez
GE Batista
GJ van der Heijden
H Gao
IH Witten
J Demšar
J Honaker
J Honaker
J Scheffer
JA Sterne
JL Schafer
JL Schafer
JR Quinlan
K Abayomi
KM Ting
L Breiman
L Breiman
L Rokach
M Fichman
M Khalilia
M Spratt
MA Klebanoff
MJ Azur
NJ Horton
PJ García-Laencina
PJ Kelly
PN Tan
RJ Little
S García
S Van Buuren
S Van Buuren
SS Chae
SS Choi
U Garciarena
V Vapnik
X Chen
Y Dong
Y Freund
Y He
Z Che
Z Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2020
Field of study

Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases

Crossref

University of East Anglia digital repository

Multiple imputation for estimating hazard ratios and predictive abilities in case-cohort surveys

Author: A Alperovitch
AP Dempster
B Langholz
DB Rubin
DB Rubin
F Harrell
FE Harrell
H Marti
Helena Marti
I Tzoulaki
K Chen
L Carcaillon
Laure Carcaillon
M Kulich
M Pencina
Michel Chavance
N Breslow
O Borgan
R Little
R Prentice
TH Scheike
TJ Wang
TM Therneau
WK Kremers
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Abstract Background The weighted estimators generally used for analyzing case-cohort studies are not fully efficient and naive estimates of the predictive ability of a model from case-cohort data depend on the subcohort size. However, case-cohort studies represent a special type of incomplete data, and methods for analyzing incomplete data should be appropriate, in particular multiple imputation (MI). Methods We performed simulations to validate the MI approach for estimating hazard ratios and the predictive ability of a model or of an additional variable in case-cohort surveys. As an illustration, we analyzed a case-cohort survey from the Three-City study to estimate the predictive ability of D-dimer plasma concentration on coronary heart disease (CHD) and on vascular dementia (VaD) risks. Results When the imputation model of the phase-2 variable was correctly specified, MI estimates of hazard ratios and predictive abilities were similar to those obtained with full data. When the imputation model was misspecified, MI could provide biased estimates of hazard ratios and predictive abilities. In the Three-City case-cohort study, elevated D-dimer levels increased the risk of VaD (hazard ratio for two consecutive tertiles = 1.69, 95%CI: 1.63-1.74). However, D-dimer levels did not improve the predictive ability of the model. Conclusions MI is a simple approach for analyzing case-cohort data and provides an easy evaluation of the predictive ability of a model or of an additional variable.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

HAL-Inserm

PubMed Central

HAL UVSQ

Integrated multiple mediation analysis: A robustness–specificity trade-off in causal structure

Author: Albert JM
Andrews RM
Arah OA
Chen CJ
Chen C‐J
Daniel RM
Ding P
Fasanelli F
Geneletti S
Gibbard A
Hafeman DM
Hafeman DM
Hernán M
Huang YT
Huang Y‐T
Lange T
Lin JH
Lin SH
Lin S‐H
Little RJ
MacKinnon DP
Moreno‐Betancur M
Pearl J
Pearl J
Robins JM
Rubin DB
Smith LH
Steen J
Taguri M
Tchetgen Tchetgen EJ
VanderWeele T
VanderWeele TJ
VanderWeele TJ
VanderWeele TJ
VanderWeele TJ
VanderWeele TJ
VanderWeele TJ
VanderWeele TJ
VanderWeele TJ
VanderWeele TJ
VanderWeele TJ
Vansteelandt S
Wang W
Publication venue: Collection of Biostatistics Research Archive
Publication date: 26/05/2020
Field of study

Recent methodological developments in causal mediation analysis have addressed several issues regarding multiple mediators. However, these developed methods differ in their definitions of causal parameters, assumptions for identification, and interpretations of causal effects, making it unclear which method ought to be selected when investigating a given causal effect. Thus, in this study, we construct an integrated framework, which unifies all existing methodologies, as a standard for mediation analysis with multiple mediators. To clarify the relationship between existing methods, we propose four strategies for effect decomposition: two-way, partially forward, partially backward, and complete decompositions. This study reveals how the direct and indirect effects of each strategy are explicitly and correctly interpreted as path-specific effects under different causal mediation structures. In the integrated framework, we further verify the utility of the interventional analogues of direct and indirect effects, especially when natural direct and indirect effects cannot be identified or when cross-world exchangeability is invalid. Consequently, this study yields a robustness–specificity trade-off in the choice of strategies. Inverse probability weighting is considered for estimation. The four strategies are further applied to a simulation study for performance evaluation and for analyzing the Risk Evaluation of Viral Load Elevation and Associated Liver Disease/Cancer data set from Taiwan to investigate the causal effect of hepatitis C virus infection on mortality

Crossref

Collection Of Biostatistics Research Archive

Sensitivity Analysis for Not-at-Random Missing Data in Trial-Based Cost-Effectiveness Analysis : A Tutorial

Author: A Briggs
A Burton
A Creemers
A Manca
A Marshall
AJ Mason
B Ratitch
Baptiste Leurent
CH Mallinckrodt
DB Rubin
E Fenwick
EuroQol Group
G Molenberghs
H Thijs
IR White
IR White
IR White
JA Barber
JAC Sterne
James R. Carpenter
JH Ware
JJ Heckman
JR Carpenter
JR Carpenter
JR Carpenter
Manuel Gomes
MF Drummond
MG Kenward
MJ Daniels
ML Bell
National Research Council
P Diggle
P Dolan
P Hayati Rezvan
R Faria
Richard Grieve
Rita Faria
RJ Beeken
RJ Beeken
RJ Little
RJA Little
RJA Little
RJA Little
S Van Buuren
SM Noble
Stephen Morris
T Burzykowski
T Permutt
WC Black
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 20/04/2018
Field of study

Cost-effectiveness analyses (CEA) of randomised controlled trials are a key source of information for health care decision makers. Missing data are, however, a common issue that can seriously undermine their validity. A major concern is that the chance of data being missing may be directly linked to the unobserved value itself [missing not at random (MNAR)]. For example, patients with poorer health may be less likely to complete quality-of-life questionnaires. However, the extent to which this occurs cannot be ascertained from the data at hand. Guidelines recommend conducting sensitivity analyses to assess the robustness of conclusions to plausible MNAR assumptions, but this is rarely done in practice, possibly because of a lack of practical guidance. This tutorial aims to address this by presenting an accessible framework and practical guidance for conducting sensitivity analysis for MNAR data in trial-based CEA. We review some of the methods for conducting sensitivity analysis, but focus on one particularly accessible approach, where the data are multiply-imputed and then modified to reflect plausible MNAR scenarios. We illustrate the implementation of this approach on a weight-loss trial, providing the software code. We then explore further issues around its use in practice

Crossref

LSHTM Research Online

UCL Discovery

White Rose Research Online