65,252 research outputs found

    Robustness of Random Forest-based gene selection methods

    Get PDF
    Gene selection is an important part of microarray data analysis because it provides information that can lead to a better mechanistic understanding of an investigated phenomenon. At the same time, gene selection is very difficult because of the noisy nature of microarray data. As a consequence, gene selection is often performed with machine learning methods. The Random Forest method is particularly well suited for this purpose. In this work, four state-of-the-art Random Forest-based feature selection methods were compared in a gene selection context. The analysis focused on the stability of selection because, although it is necessary for determining the significance of results, it is often ignored in similar studies. The comparison of post-selection accuracy in the validation of Random Forest classifiers revealed that all investigated methods were equivalent in this context. However, the methods substantially differed with respect to the number of selected genes and the stability of selection. Of the analysed methods, the Boruta algorithm predicted the most genes as potentially important. The post-selection classifier error rate, which is a frequently used measure, was found to be a potentially deceptive measure of gene selection quality. When the number of consistently selected genes was considered, the Boruta algorithm was clearly the best. Although it was also the most computationally intensive method, the Boruta algorithm's computational demands could be reduced to levels comparable to those of other algorithms by replacing the Random Forest importance with a comparable measure from Random Ferns (a similar but simplified classifier). Despite their design assumptions, the minimal optimal selection methods, were found to select a high fraction of false positives

    Synthetic learner: model-free inference on treatments over time

    Full text link
    Understanding of the effect of a particular treatment or a policy pertains to many areas of interest -- ranging from political economics, marketing to health-care and personalized treatment studies. In this paper, we develop a non-parametric, model-free test for detecting the effects of treatment over time that extends widely used Synthetic Control tests. The test is built on counterfactual predictions arising from many learning algorithms. In the Neyman-Rubin potential outcome framework with possible carry-over effects, we show that the proposed test is asymptotically consistent for stationary, beta mixing processes. We do not assume that class of learners captures the correct model necessarily. We also discuss estimates of the average treatment effect, and we provide regret bounds on the predictive performance. To the best of our knowledge, this is the first set of results that allow for example any Random Forest to be useful for provably valid statistical inference in the Synthetic Control setting. In experiments, we show that our Synthetic Learner is substantially more powerful than classical methods based on Synthetic Control or Difference-in-Differences, especially in the presence of non-linear outcome models

    Particle mesh simulations of the Lyman-alpha forest and the signature of Baryon Acoustic Oscillations in the intergalactic medium

    Full text link
    We present a set of ultra-large particle-mesh simulations of the LyA forest targeted at understanding the imprint of baryon acoustic oscillations (BAO) in the inter-galactic medium. We use 9 dark matter only simulations which can, for the first time, simultaneously resolve the Jeans scale of the intergalactic gas while covering the large volumes required to adequately sample the acoustic feature. Mock absorption spectra are generated using the fluctuating Gunn-Peterson approximation which have approximately correct flux probability density functions (PDFs) and small-scale power spectra. On larger scales there is clear evidence in the redshift space correlation function for an acoustic feature, which matches a linear theory template with constant bias. These spectra, which we make publicly available, can be used to test pipelines, plan future experiments and model various physical effects. As an illustration we discuss the basic properties of the acoustic signal in the forest, the scaling of errors with noise and source number density, modified statistics to treat mean flux evolution and misestimation, and non-gravitational sources such as fluctuations in the photo-ionizing background and temperature fluctuations due to HeII reionization.Comment: 11 pages, 10 figures, minor changes to address referee repor
    corecore