120,501 research outputs found
Reliable ABC model choice via random forests
Approximate Bayesian computation (ABC) methods provide an elaborate approach
to Bayesian inference on complex models, including model choice. Both
theoretical arguments and simulation experiments indicate, however, that model
posterior probabilities may be poorly evaluated by standard ABC techniques. We
propose a novel approach based on a machine learning tool named random forests
to conduct selection among the highly complex models covered by ABC algorithms.
We thus modify the way Bayesian model selection is both understood and
operated, in that we rephrase the inferential goal as a classification problem,
first predicting the model that best fits the data with random forests and
postponing the approximation of the posterior probability of the predicted MAP
for a second stage also relying on random forests. Compared with earlier
implementations of ABC model choice, the ABC random forest approach offers
several potential improvements: (i) it often has a larger discriminative power
among the competing models, (ii) it is more robust against the number and
choice of statistics summarizing the data, (iii) the computing effort is
drastically reduced (with a gain in computation efficiency of at least fifty),
and (iv) it includes an approximation of the posterior probability of the
selected model. The call to random forests will undoubtedly extend the range of
size of datasets and complexity of models that ABC can handle. We illustrate
the power of this novel methodology by analyzing controlled experiments as well
as genuine population genetics datasets. The proposed methodologies are
implemented in the R package abcrf available on the CRAN.Comment: 39 pages, 15 figures, 6 table
Narrowing the Gap: Random Forests In Theory and In Practice
Despite widespread interest and practical use, the theoretical properties of
random forests are still not well understood. In this paper we contribute to
this understanding in two ways. We present a new theoretically tractable
variant of random regression forests and prove that our algorithm is
consistent. We also provide an empirical evaluation, comparing our algorithm
and other theoretically tractable random forest models to the random forest
algorithm used in practice. Our experiments provide insight into the relative
importance of different simplifications that theoreticians have made to obtain
tractable models for analysis.Comment: Under review by the International Conference on Machine Learning
(ICML) 201
Stacking for machine learning redshifts applied to SDSS galaxies
We present an analysis of a general machine learning technique called
'stacking' for the estimation of photometric redshifts. Stacking techniques can
feed the photometric redshift estimate, as output by a base algorithm, back
into the same algorithm as an additional input feature in a subsequent learning
round. We shown how all tested base algorithms benefit from at least one
additional stacking round (or layer). To demonstrate the benefit of stacking,
we apply the method to both unsupervised machine learning techniques based on
self-organising maps (SOMs), and supervised machine learning methods based on
decision trees. We explore a range of stacking architectures, such as the
number of layers and the number of base learners per layer. Finally we explore
the effectiveness of stacking even when using a successful algorithm such as
AdaBoost. We observe a significant improvement of between 1.9% and 21% on all
computed metrics when stacking is applied to weak learners (such as SOMs and
decision trees). When applied to strong learning algorithms (such as AdaBoost)
the ratio of improvement shrinks, but still remains positive and is between
0.4% and 2.5% for the explored metrics and comes at almost no additional
computational cost.Comment: 13 pages, 3 tables, 7 figures version accepted by MNRAS, minor text
updates. Results and conclusions unchange
Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods
We introduce a framework to build a survival/risk bump hunting model with a
censored time-to-event response. Our Survival Bump Hunting (SBH) method is
based on a recursive peeling procedure that uses a specific survival peeling
criterion derived from non/semi-parametric statistics such as the
hazards-ratio, the log-rank test or the Nelson-Aalen estimator. To optimize the
tuning parameter of the model and validate it, we introduce an objective
function based on survival or prediction-error statistics, such as the log-rank
test and the concordance error rate. We also describe two alternative
cross-validation techniques adapted to the joint task of decision-rule making
by recursive peeling and survival estimation. Numerical analyses show the
importance of replicated cross-validation and the differences between criteria
and techniques in both low and high-dimensional settings. Although several
non-parametric survival models exist, none addresses the problem of directly
identifying local extrema. We show how SBH efficiently estimates extreme
survival/risk subgroups unlike other models. This provides an insight into the
behavior of commonly used models and suggests alternatives to be adopted in
practice. Finally, our SBH framework was applied to a clinical dataset. In it,
we identified subsets of patients characterized by clinical and demographic
covariates with a distinct extreme survival outcome, for which tailored medical
interventions could be made. An R package `PRIMsrc` is available on CRAN and
GitHub.Comment: Keywords: Exploratory Survival/Risk Analysis, Survival/Risk
Estimation & Prediction, Non-Parametric Method, Cross-Validation, Bump
Hunting, Rule-Induction Metho
Detection of Uniform and Non-Uniform Differential Item Functioning by Item Focussed Trees
Detection of differential item functioning by use of the logistic modelling
approach has a long tradition. One big advantage of the approach is that it can
be used to investigate non-uniform DIF as well as uniform DIF. The classical
approach allows to detect DIF by distinguishing between multiple groups. We
propose an alternative method that is a combination of recursive partitioning
methods (or trees) and logistic regression methodology to detect uniform and
non-uniform DIF in a nonparametric way. The output of the method are trees that
visualize in a simple way the structure of DIF in an item showing which
variables are interacting in which way when generating DIF. In addition we
consider a logistic regression method in which DIF can by induced by a vector
of covariates, which may include categorical but also continuous covariates.
The methods are investigated in simulation studies and illustrated by two
applications.Comment: 32 pages, 13 figures, 7 table
- …