65,252 research outputs found
Robustness of Random Forest-based gene selection methods
Gene selection is an important part of microarray data analysis because it
provides information that can lead to a better mechanistic understanding of an
investigated phenomenon. At the same time, gene selection is very difficult
because of the noisy nature of microarray data. As a consequence, gene
selection is often performed with machine learning methods. The Random Forest
method is particularly well suited for this purpose. In this work, four
state-of-the-art Random Forest-based feature selection methods were compared in
a gene selection context. The analysis focused on the stability of selection
because, although it is necessary for determining the significance of results,
it is often ignored in similar studies.
The comparison of post-selection accuracy in the validation of Random Forest
classifiers revealed that all investigated methods were equivalent in this
context. However, the methods substantially differed with respect to the number
of selected genes and the stability of selection. Of the analysed methods, the
Boruta algorithm predicted the most genes as potentially important.
The post-selection classifier error rate, which is a frequently used measure,
was found to be a potentially deceptive measure of gene selection quality. When
the number of consistently selected genes was considered, the Boruta algorithm
was clearly the best. Although it was also the most computationally intensive
method, the Boruta algorithm's computational demands could be reduced to levels
comparable to those of other algorithms by replacing the Random Forest
importance with a comparable measure from Random Ferns (a similar but
simplified classifier). Despite their design assumptions, the minimal optimal
selection methods, were found to select a high fraction of false positives
Synthetic learner: model-free inference on treatments over time
Understanding of the effect of a particular treatment or a policy pertains to
many areas of interest -- ranging from political economics, marketing to
health-care and personalized treatment studies. In this paper, we develop a
non-parametric, model-free test for detecting the effects of treatment over
time that extends widely used Synthetic Control tests. The test is built on
counterfactual predictions arising from many learning algorithms. In the
Neyman-Rubin potential outcome framework with possible carry-over effects, we
show that the proposed test is asymptotically consistent for stationary, beta
mixing processes. We do not assume that class of learners captures the correct
model necessarily. We also discuss estimates of the average treatment effect,
and we provide regret bounds on the predictive performance. To the best of our
knowledge, this is the first set of results that allow for example any Random
Forest to be useful for provably valid statistical inference in the Synthetic
Control setting. In experiments, we show that our Synthetic Learner is
substantially more powerful than classical methods based on Synthetic Control
or Difference-in-Differences, especially in the presence of non-linear outcome
models
Particle mesh simulations of the Lyman-alpha forest and the signature of Baryon Acoustic Oscillations in the intergalactic medium
We present a set of ultra-large particle-mesh simulations of the LyA forest
targeted at understanding the imprint of baryon acoustic oscillations (BAO) in
the inter-galactic medium. We use 9 dark matter only simulations which can, for
the first time, simultaneously resolve the Jeans scale of the intergalactic gas
while covering the large volumes required to adequately sample the acoustic
feature. Mock absorption spectra are generated using the fluctuating
Gunn-Peterson approximation which have approximately correct flux probability
density functions (PDFs) and small-scale power spectra. On larger scales there
is clear evidence in the redshift space correlation function for an acoustic
feature, which matches a linear theory template with constant bias. These
spectra, which we make publicly available, can be used to test pipelines, plan
future experiments and model various physical effects. As an illustration we
discuss the basic properties of the acoustic signal in the forest, the scaling
of errors with noise and source number density, modified statistics to treat
mean flux evolution and misestimation, and non-gravitational sources such as
fluctuations in the photo-ionizing background and temperature fluctuations due
to HeII reionization.Comment: 11 pages, 10 figures, minor changes to address referee repor
- …
