1,179 research outputs found
Variable selection with Random Forests for missing data
Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a newly suggested importance measure for several missing data generating processes. The ability to distinguish relevant from non-relevant variables has been investigated for these procedures in combination with two popular variable selection methods. Findings and recommendations: Complete case analysis should not be applied as it lead to inaccurate variable selection and models with the worst prediction accuracy. Multiple imputation is a good means to select variables that would be of relevance in fully observed data. It produced the best prediction accuracy. By contrast, the application of the new importance measure causes a selection of variables that reflects the actual data situation, i.e. that takes the occurrence of missing values into account. It's error was only negligible worse compared to imputation
Random Forest variable importance with missing data
Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations
Essential guidelines for computational method benchmarking
In computational biology and other sciences, researchers are frequently faced
with a choice between several computational methods for performing data
analyses. Benchmarking studies aim to rigorously compare the performance of
different methods using well-characterized benchmark datasets, to determine the
strengths of each method or to provide recommendations regarding suitable
choices of methods for an analysis. However, benchmarking studies must be
carefully designed and implemented to provide accurate, unbiased, and
informative results. Here, we summarize key practical guidelines and
recommendations for performing high-quality benchmarking analyses, based on our
experiences in computational biology.Comment: Minor update
Essential guidelines for computational method benchmarking
In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology
Statistically reinforced machine learning for nonlinear patterns and variable interactions
Most statistical models assume linearity and few variable interactions, even though real‐world ecological patterns often result from nonlinear and highly interactive processes. We here introduce a set of novel empirical modeling techniques which can address this mismatch: statistically reinforced machine learning. We demonstrate the behaviors of three techniques (conditional inference tree, model‐based tree, and permutation‐based random forest) by analyzing an artificially generated example dataset that contains patterns based on nonlinearity and variable interactions. The results show the potential of statistically reinforced machine learning algorithms to detect nonlinear relationships and higher‐order interactions. Estimation reliability for any technique, however, depended on sample size. The applications of statistically reinforced machine learning approaches would be particularly beneficial for investigating (1) novel patterns for which shapes cannot be assumed a priori, (2) higher‐order interactions which are often overlooked in parametric statistics, (3) context dependency where patterns change depending on other conditions, (4) significance and effect sizes of variables while taking nonlinearity and variable interactions into account, and (5) a hypothesis using parametric statistics after identifying patterns using statistically reinforced machine learning techniques
The concept of conditional method agreement trees with single measurements per subject
The concept of conditional method agreement is introduced and respective solutions are proposed to define homogeneous subgroups in terms of mean and variance of the differences in measurements
Autostrata: Improved Automatic Stratification for Coarsened Exact Matching
We commonly adjust for confounding factors in analytical observational epidemiologyto reduce biases that distort the results. Stratification and matching are standard methods for reducing confounder bias. Coarsened exact matching (CEM) is a recent method using stratification to coarsen variables into categorical variables to enable exact matching of exposed and nonexposed subjects. CEM’s standard approach to stratifying variables is histogram binning. However, histogram binning creates strata of uniformwidths and does not distinguish between exposed and nonexposed. We present Autostrata, a novel algorithmic approach to stratification producing improved results in CEM and providing more control to the researcher
Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies
Abstract Background The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly “evidence-based”. Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. Main message In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of “evidence-based” statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. Conclusion We suggest that benchmark studies—a method of assessment of statistical methods using real-world datasets—might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research
Analysis of missing data with random forests
Random Forests are widely used for data prediction and interpretation purposes. They
show many appealing characteristics, such as the ability to deal with high dimensional data,
complex interactions and correlations. Furthermore, missing values can easily be processed
by the built-in procedure of surrogate splits. However, there is only little knowledge about
the properties of recursive partitioning in missing data situations. Therefore, extensive
simulation studies and empirical evaluations have been conducted to gain deeper insight.
In addition, new methods have been developed to enhance methodology and solve current
issues of data interpretation, prediction and variable selection.
A variable’s relevance in a Random Forest can be assessed by means of importance
measures. Unfortunately, existing methods cannot be applied when the data contain miss-
ing values. Thus, one of the most appreciated properties of Random Forests – its ability
to handle missing values – gets lost for the computation of such measures. This work
presents a new approach that is designed to deal with missing values in an intuitive and
straightforward way, yet retains widely appreciated qualities of existing methods. Results
indicate that it meets sensible requirements and shows good variable ranking properties.
Random Forests provide variable selection that is usually based on importance mea-
sures. An extensive review of corresponding literature led to the development of a new
approach that is based on a profound theoretical framework and meets important statis-
tical properties. A comparison to another eight popular methods showed that it controls
the test-wise and family-wise error rate, provides a higher power to distinguish relevant
from non-relevant variables and leads to models located among the best performing ones.
Alternative ways to handle missing values are the application of imputation methods
and complete case analysis. Yet it is unknown to what extent these approaches are able
to provide sensible variable rankings and meaningful variable selections. Investigations
showed that complete case analysis leads to inaccurate variable selection as it may in-
appropriately penalize the importance of fully observed variables. By contrast, the new
importance measure decreases for variables with missing values and therefore causes se-
lections that accurately reflect the information given in actual data situations. Multiple
imputation leads to an assessment of a variable’s importance and to selection frequencies
that would be expected for data that was completely observed. In several performance
evaluations the best prediction accuracy emerged from multiple imputation, closely fol-
lowed by the application of surrogate splits. Complete case analysis clearly performed
worst
The stationary phase-specific sRNA FimR2 is a multifunctional regulator of bacterial motility, biofilm formation and virulence
Bacterial pathogens employ a plethora of virulence factors for host invasion, and their use is tightly regulated to maximize infection efficiency and manage resources in a nutrient-limited environment. Here we show that during Escherichia coli stationary phase the 3' UTR-derived small non-coding RNA FimR2 regulates fimbrial and flagellar biosynthesis at the post-transcriptional level, leading to biofilm formation as the dominant mode of survival under conditions of nutrient depletion. FimR2 interacts with the translational regulator CsrA, antagonizing its functions and firmly tightening control over motility and biofilm formation. Generated through RNase E cleavage, FimR2 regulates stationary phase biology by fine-tuning target mRNA levels independently of the chaperones Hfq and ProQ. The Salmonella enterica orthologue of FimR2 induces effector protein secretion by the type III secretion system and stimulates infection, thus linking the sRNA to virulence. This work reveals the importance of bacterial sRNAs in modulating various aspects of bacterial physiology including stationary phase and virulence
- …
