12,950 research outputs found
Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.
Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.
Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.
Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR
MissForest - nonparametric missing value imputation for mixed-type data
Modern data acquisition based on high-throughput technology is often facing
the problem of missing data. Algorithms commonly used in the analysis of such
large-scale data often depend on a complete set. Missing value imputation
offers a solution to this problem. However, the majority of available
imputation methods are restricted to one type of variable only: continuous or
categorical. For mixed-type data the different types are usually handled
separately. Therefore, these methods ignore possible relations between variable
types. We propose a nonparametric method which can cope with different types of
variables simultaneously. We compare several state of the art methods for the
imputation of missing values. We propose and evaluate an iterative imputation
method (missForest) based on a random forest. By averaging over many unpruned
classification or regression trees random forest intrinsically constitutes a
multiple imputation scheme. Using the built-in out-of-bag error estimates of
random forest we are able to estimate the imputation error without the need of
a test set. Evaluation is performed on multiple data sets coming from a diverse
selection of biological fields with artificially introduced missing values
ranging from 10% to 30%. We show that missForest can successfully handle
missing values, particularly in data sets including different types of
variables. In our comparative study missForest outperforms other methods of
imputation especially in data settings where complex interactions and nonlinear
relations are suspected. The out-of-bag imputation error estimates of
missForest prove to be adequate in all settings. Additionally, missForest
exhibits attractive computational efficiency and can cope with high-dimensional
data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201
Missing.... presumed at random: cost-analysis of incomplete data
When collecting patient-level resource use data for statistical analysis, for some patients and in some categories of resource use, the required count will not be observed. Although this problem must arise in most reported economic evaluations containing patient-level data, it is rare for authors to detail how the problem was overcome. Statistical packages may default to handling missing data through a so-called complete case analysis, while some recent cost-analyses have appeared to favour an available case approach. Both of these methods are problematic: complete case analysis is inefficient and is likely to be biased; available case analysis, by employing different numbers of observations for each resource use item, generates severe problems for standard statistical inference. Instead we explore imputation methods for generating replacement values for missing data that will permit complete case analysis using the whole data set and we illustrate these methods using two data sets that had incomplete resource use information
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Data Imputation through the Identification of Local Anomalies
We introduce a comprehensive and statistical framework in a model free
setting for a complete treatment of localized data corruptions due to severe
noise sources, e.g., an occluder in the case of a visual recording. Within this
framework, we propose i) a novel algorithm to efficiently separate, i.e.,
detect and localize, possible corruptions from a given suspicious data instance
and ii) a Maximum A Posteriori (MAP) estimator to impute the corrupted data. As
a generalization to Euclidean distance, we also propose a novel distance
measure, which is based on the ranked deviations among the data attributes and
empirically shown to be superior in separating the corruptions. Our algorithm
first splits the suspicious instance into parts through a binary partitioning
tree in the space of data attributes and iteratively tests those parts to
detect local anomalies using the nominal statistics extracted from an
uncorrupted (clean) reference data set. Once each part is labeled as anomalous
vs normal, the corresponding binary patterns over this tree that characterize
corruptions are identified and the affected attributes are imputed. Under a
certain conditional independency structure assumed for the binary patterns, we
analytically show that the false alarm rate of the introduced algorithm in
detecting the corruptions is independent of the data and can be directly set
without any parameter tuning. The proposed framework is tested over several
well-known machine learning data sets with synthetically generated corruptions;
and experimentally shown to produce remarkable improvements in terms of
classification purposes with strong corruption separation capabilities. Our
experiments also indicate that the proposed algorithms outperform the typical
approaches and are robust to varying training phase conditions
Lung Segmentation from Chest X-rays using Variational Data Imputation
Pulmonary opacification is the inflammation in the lungs caused by many
respiratory ailments, including the novel corona virus disease 2019 (COVID-19).
Chest X-rays (CXRs) with such opacifications render regions of lungs
imperceptible, making it difficult to perform automated image analysis on them.
In this work, we focus on segmenting lungs from such abnormal CXRs as part of a
pipeline aimed at automated risk scoring of COVID-19 from CXRs. We treat the
high opacity regions as missing data and present a modified CNN-based image
segmentation network that utilizes a deep generative model for data imputation.
We train this model on normal CXRs with extensive data augmentation and
demonstrate the usefulness of this model to extend to cases with extreme
abnormalities.Comment: Accepted to be presented at the first Workshop on the Art of Learning
with Missing Values (Artemiss) hosted by the 37th International Conference on
Machine Learning (ICML). Source code, training data and the trained models
are available here: https://github.com/raghavian/lungVAE
- …