172 research outputs found
Variable selection with Random Forests for missing data
Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a newly suggested importance measure for several missing data generating processes. The ability to distinguish relevant from non-relevant variables has been investigated for these procedures in combination with two popular variable selection methods. Findings and recommendations: Complete case analysis should not be applied as it lead to inaccurate variable selection and models with the worst prediction accuracy. Multiple imputation is a good means to select variables that would be of relevance in fully observed data. It produced the best prediction accuracy. By contrast, the application of the new importance measure causes a selection of variables that reflects the actual data situation, i.e. that takes the occurrence of missing values into account. It's error was only negligible worse compared to imputation
Random Forest variable importance with missing data
Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations
Autostrata: Improved Automatic Stratification for Coarsened Exact Matching
We commonly adjust for confounding factors in analytical observational epidemiologyto reduce biases that distort the results. Stratification and matching are standard methods for reducing confounder bias. Coarsened exact matching (CEM) is a recent method using stratification to coarsen variables into categorical variables to enable exact matching of exposed and nonexposed subjects. CEM’s standard approach to stratifying variables is histogram binning. However, histogram binning creates strata of uniformwidths and does not distinguish between exposed and nonexposed. We present Autostrata, a novel algorithmic approach to stratification producing improved results in CEM and providing more control to the researcher
Analysis of missing data with random forests
Random Forests are widely used for data prediction and interpretation purposes. They
show many appealing characteristics, such as the ability to deal with high dimensional data,
complex interactions and correlations. Furthermore, missing values can easily be processed
by the built-in procedure of surrogate splits. However, there is only little knowledge about
the properties of recursive partitioning in missing data situations. Therefore, extensive
simulation studies and empirical evaluations have been conducted to gain deeper insight.
In addition, new methods have been developed to enhance methodology and solve current
issues of data interpretation, prediction and variable selection.
A variable’s relevance in a Random Forest can be assessed by means of importance
measures. Unfortunately, existing methods cannot be applied when the data contain miss-
ing values. Thus, one of the most appreciated properties of Random Forests – its ability
to handle missing values – gets lost for the computation of such measures. This work
presents a new approach that is designed to deal with missing values in an intuitive and
straightforward way, yet retains widely appreciated qualities of existing methods. Results
indicate that it meets sensible requirements and shows good variable ranking properties.
Random Forests provide variable selection that is usually based on importance mea-
sures. An extensive review of corresponding literature led to the development of a new
approach that is based on a profound theoretical framework and meets important statis-
tical properties. A comparison to another eight popular methods showed that it controls
the test-wise and family-wise error rate, provides a higher power to distinguish relevant
from non-relevant variables and leads to models located among the best performing ones.
Alternative ways to handle missing values are the application of imputation methods
and complete case analysis. Yet it is unknown to what extent these approaches are able
to provide sensible variable rankings and meaningful variable selections. Investigations
showed that complete case analysis leads to inaccurate variable selection as it may in-
appropriately penalize the importance of fully observed variables. By contrast, the new
importance measure decreases for variables with missing values and therefore causes se-
lections that accurately reflect the information given in actual data situations. Multiple
imputation leads to an assessment of a variable’s importance and to selection frequencies
that would be expected for data that was completely observed. In several performance
evaluations the best prediction accuracy emerged from multiple imputation, closely fol-
lowed by the application of surrogate splits. Complete case analysis clearly performed
worst
Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies
Abstract Background The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly “evidence-based”. Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. Main message In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of “evidence-based” statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. Conclusion We suggest that benchmark studies—a method of assessment of statistical methods using real-world datasets—might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research
Vaccination frequency in people newly diagnosed with multiple sclerosis
Background: Infections are discussed as risk factor for multiple sclerosis (MS) development and relapses. This may lead to decreased vaccination frequency in newly diagnosed patients.Objective: The aim of this study was to evaluate the relation of MS diagnosis to subsequent vaccination frequency.Methods: Based on German ambulatory claims data from 2005 to 2019, regression models were used to assess the relation of MS diagnosis (n = 12,270) to vaccination. A cohort of patients with MS was compared to control cohorts with Crohn's disease, psoriasis, and without these autoimmune diseases (total n = 198,126) in the 5 years after and before diagnosis.Results: Patients with MS were less likely to be vaccinated compared to persons without the autoimmune diseases 5 years after diagnosis (odds ratio = 0.91, p < 0.001). Exceptions were vaccinations against influenza (1.29, p < 0.001) and pneumococci (1.41, p < 0.001). Differences were strong but less pronounced after than before diagnosis (p < 0.001). The likelihood of vaccination was also lower compared to patients with Crohn's disease or psoriasis.Conclusions: Patients with MS were not adequately vaccinated despite guideline recommendations. Increasing awareness about the importance of vaccination is warranted to reduce the risk of infection, in particular, in patients with MS receiving immunotherapies
Essential guidelines for computational method benchmarking
In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology
Essential guidelines for computational method benchmarking
In computational biology and other sciences, researchers are frequently faced
with a choice between several computational methods for performing data
analyses. Benchmarking studies aim to rigorously compare the performance of
different methods using well-characterized benchmark datasets, to determine the
strengths of each method or to provide recommendations regarding suitable
choices of methods for an analysis. However, benchmarking studies must be
carefully designed and implemented to provide accurate, unbiased, and
informative results. Here, we summarize key practical guidelines and
recommendations for performing high-quality benchmarking analyses, based on our
experiences in computational biology.Comment: Minor update
- …