Search CORE

1,179 research outputs found

Variable selection with Random Forests for missing data

Author: Hapfelmeier Alexander
Ulm Kurt
Publication venue
Publication date: 15/01/2013
Field of study

Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a newly suggested importance measure for several missing data generating processes. The ability to distinguish relevant from non-relevant variables has been investigated for these procedures in combination with two popular variable selection methods. Findings and recommendations: Complete case analysis should not be applied as it lead to inaccurate variable selection and models with the worst prediction accuracy. Multiple imputation is a good means to select variables that would be of relevance in fully observed data. It produced the best prediction accuracy. By contrast, the application of the new importance measure causes a selection of variables that reflects the actual data situation, i.e. that takes the occurrence of missing values into account. It's error was only negligible worse compared to imputation

Open Access LMU ( Ludwig-Maximilians-Univ. München)

Random Forest variable importance with missing data

Author: Hapfelmeier Alexander
Hothorn Torsten
Ulm Kurt
Publication venue
Publication date: 15/02/2012
Field of study

Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations

Open Access LMU ( Ludwig-Maximilians-Univ. München)

Essential guidelines for computational method benchmarking

Author: Boulesteix Anne-Laure
Cannoodt Robrecht
Gardner Paul P.
Hapfelmeier Alexander
Robinson Mark D.
Saelens Wouter
Saeys Yvan
Soneson Charlotte
Weber Lukas M.
Publication venue
Publication date: 01/01/2019
Field of study

In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.Comment: Minor update

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Open Access LMU ( Ludwig-Maximilians-Univ. München)

ZORA

Essential guidelines for computational method benchmarking

Author: Boulesteix Anne-Laure
Cannoodt Robrecht
Gardner Paul P
Hapfelmeier Alexander
Robinson Mark D
Saelens Wouter
Saeys Yvan
Soneson Charlotte
Weber Lukas M
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Ghent University Academic Bibliography

Open Access LMU ( Ludwig-Maximilians-Univ. München)

ZORA

Statistically reinforced machine learning for nonlinear patterns and variable interactions

Author: Andrade-Linares
Baldi
Bastolla
Bergmann
Billick
Bischl
Breiman
Breiman
Breiman
Burkepile
Campbell
Campetella
Chaudhary
Cleveland
Crisci
Cumming
Cutler
De'ath
Delgado-Baquerizo
Dodds
Domingos
Douglas
Elith
Elith
Gilbert
Hapfelmeier
Hapfelmeier
Hastie
Heffernan
Heino
Hobbs
Hochachka
Hothorn
Hothorn
Hunter
Iwasaki
James
Kelling
Levy
Mayfield
Mingers
Nes
Olden
Olden
Peters
Phillips
Pickett
R Development Core Team
Recknagel
Rillig
Rillig
Ryo
Scheffer
Soranno
Strobl
Strobl
Thessen
Thomas
Tonkin
Urban
Wilkinson
Winham
Zeileis
Zeileis
Publication venue
Publication date: 01/01/2017
Field of study

Most statistical models assume linearity and few variable interactions, even though real‐world ecological patterns often result from nonlinear and highly interactive processes. We here introduce a set of novel empirical modeling techniques which can address this mismatch: statistically reinforced machine learning. We demonstrate the behaviors of three techniques (conditional inference tree, model‐based tree, and permutation‐based random forest) by analyzing an artificially generated example dataset that contains patterns based on nonlinearity and variable interactions. The results show the potential of statistically reinforced machine learning algorithms to detect nonlinear relationships and higher‐order interactions. Estimation reliability for any technique, however, depended on sample size. The applications of statistically reinforced machine learning approaches would be particularly beneficial for investigating (1) novel patterns for which shapes cannot be assumed a priori, (2) higher‐order interactions which are often overlooked in parametric statistics, (3) context dependency where patterns change depending on other conditions, (4) significance and effect sizes of variables while taking nonlinearity and variable interactions into account, and (5) a hypothesis using parametric statistics after identifying patterns using statistically reinforced machine learning techniques

Institutional Repository of the Freie Universität Berlin

Crossref

The concept of conditional method agreement trees with single measurements per subject

Author: Hapfelmeier Alexander
Karapetyan Siranush
Publication venue: Linköping University Electronic Press
Publication date: 22/08/2022
Field of study

The concept of conditional method agreement is introduced and respective solutions are proposed to define homogeneous subgroups in terms of mean and variance of the differences in measurements

Crossref

Linköping Electronic Conference Proceedings

Autostrata: Improved Automatic Stratification for Coarsened Exact Matching

Author: Arnes Jo Inge
Hapfelmeier Alexander
Horsch Alexander
Publication venue: Linköping University Electronic Press, Sweden
Publication date: 22/08/2022
Field of study

We commonly adjust for confounding factors in analytical observational epidemiologyto reduce biases that distort the results. Stratification and matching are standard methods for reducing confounder bias. Coarsened exact matching (CEM) is a recent method using stratification to coarsen variables into categorical variables to enable exact matching of exposed and nonexposed subjects. CEM’s standard approach to stratifying variables is histogram binning. However, histogram binning creates strata of uniformwidths and does not distinguish between exposed and nonexposed. We present Autostrata, a novel algorithmic approach to stratification producing improved results in CEM and providing more control to the researcher

Crossref

Munin - Open Research Archive

Linköping Electronic Conference Proceedings

Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies

Author: Boulesteix Anne-Laure
Hapfelmeier Alexander
Wilson Rory
Publication venue
Publication date: 27/10/2016
Field of study

Abstract Background The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly “evidence-based”. Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. Main message In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of “evidence-based” statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. Conclusion We suggest that benchmark studies—a method of assessment of statistical methods using real-world datasets—might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research

Crossref

Directory of Open Access Journals

Open Access LMU ( Ludwig-Maximilians-Univ. München)

Analysis of missing data with random forests

Author: Hapfelmeier Alexander
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 12/10/2012
Field of study

Random Forests are widely used for data prediction and interpretation purposes. They show many appealing characteristics, such as the ability to deal with high dimensional data, complex interactions and correlations. Furthermore, missing values can easily be processed by the built-in procedure of surrogate splits. However, there is only little knowledge about the properties of recursive partitioning in missing data situations. Therefore, extensive simulation studies and empirical evaluations have been conducted to gain deeper insight. In addition, new methods have been developed to enhance methodology and solve current issues of data interpretation, prediction and variable selection. A variable’s relevance in a Random Forest can be assessed by means of importance measures. Unfortunately, existing methods cannot be applied when the data contain miss- ing values. Thus, one of the most appreciated properties of Random Forests – its ability to handle missing values – gets lost for the computation of such measures. This work presents a new approach that is designed to deal with missing values in an intuitive and straightforward way, yet retains widely appreciated qualities of existing methods. Results indicate that it meets sensible requirements and shows good variable ranking properties. Random Forests provide variable selection that is usually based on importance mea- sures. An extensive review of corresponding literature led to the development of a new approach that is based on a profound theoretical framework and meets important statis- tical properties. A comparison to another eight popular methods showed that it controls the test-wise and family-wise error rate, provides a higher power to distinguish relevant from non-relevant variables and leads to models located among the best performing ones. Alternative ways to handle missing values are the application of imputation methods and complete case analysis. Yet it is unknown to what extent these approaches are able to provide sensible variable rankings and meaningful variable selections. Investigations showed that complete case analysis leads to inaccurate variable selection as it may in- appropriately penalize the importance of fully observed variables. By contrast, the new importance measure decreases for variables with missing values and therefore causes se- lections that accurately reﬂect the information given in actual data situations. Multiple imputation leads to an assessment of a variable’s importance and to selection frequencies that would be expected for data that was completely observed. In several performance evaluations the best prediction accuracy emerged from multiple imputation, closely fol- lowed by the application of surrogate splits. Complete case analysis clearly performed worst

Digitale Hochschulschriften der LMU

The stationary phase-specific sRNA FimR2 is a multifunctional regulator of bacterial motility, biofilm formation and virulence

Author: Hapfelmeier Siegfried
Polacek Norbert
Raad Nicole
Tandon Disha
Publication venue: 'Oxford University Press (OUP)'
Publication date: 10/11/2022
Field of study

Bacterial pathogens employ a plethora of virulence factors for host invasion, and their use is tightly regulated to maximize infection efficiency and manage resources in a nutrient-limited environment. Here we show that during Escherichia coli stationary phase the 3' UTR-derived small non-coding RNA FimR2 regulates fimbrial and flagellar biosynthesis at the post-transcriptional level, leading to biofilm formation as the dominant mode of survival under conditions of nutrient depletion. FimR2 interacts with the translational regulator CsrA, antagonizing its functions and firmly tightening control over motility and biofilm formation. Generated through RNase E cleavage, FimR2 regulates stationary phase biology by fine-tuning target mRNA levels independently of the chaperones Hfq and ProQ. The Salmonella enterica orthologue of FimR2 induces effector protein secretion by the type III secretion system and stimulates infection, thus linking the sRNA to virulence. This work reveals the importance of bacterial sRNAs in modulating various aspects of bacterial physiology including stationary phase and virulence

Crossref

PubMed Central

Bern Open Repository and Information System (BORIS)