3,792 research outputs found

    Robustness of Random Forest-based gene selection methods

    Full text link
    Gene selection is an important part of microarray data analysis because it provides information that can lead to a better mechanistic understanding of an investigated phenomenon. At the same time, gene selection is very difficult because of the noisy nature of microarray data. As a consequence, gene selection is often performed with machine learning methods. The Random Forest method is particularly well suited for this purpose. In this work, four state-of-the-art Random Forest-based feature selection methods were compared in a gene selection context. The analysis focused on the stability of selection because, although it is necessary for determining the significance of results, it is often ignored in similar studies. The comparison of post-selection accuracy in the validation of Random Forest classifiers revealed that all investigated methods were equivalent in this context. However, the methods substantially differed with respect to the number of selected genes and the stability of selection. Of the analysed methods, the Boruta algorithm predicted the most genes as potentially important. The post-selection classifier error rate, which is a frequently used measure, was found to be a potentially deceptive measure of gene selection quality. When the number of consistently selected genes was considered, the Boruta algorithm was clearly the best. Although it was also the most computationally intensive method, the Boruta algorithm's computational demands could be reduced to levels comparable to those of other algorithms by replacing the Random Forest importance with a comparable measure from Random Ferns (a similar but simplified classifier). Despite their design assumptions, the minimal optimal selection methods, were found to select a high fraction of false positives

    Feature Selection with the Boruta Package

    Get PDF
    This article describes a R package Boruta, implementing a novel feature selection algorithm for finding \emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.

    Randomized lasso links microbial taxa with aquatic functional groups inferred from flow cytometry

    Get PDF
    High-nucleic-acid (HNA) and low-nucleic-acid (LNA) bacteria are two operational groups identified by flow cytometry (FCM) in aquatic systems. A number of reports have shown that HNA cell density correlates strongly with heterotrophic production, while LNA cell density does not. However, which taxa are specifically associated with these groups, and by extension, productivity has remained elusive. Here, we addressed this knowledge gap by using a machine learning-based variable selection approach that integrated FCM and 16S rRNA gene sequencing data collected from 14 freshwater lakes spanning a broad range in physicochemical conditions. There was a strong association between bacterial heterotrophic production and HNA absolute cell abundances (R-2 = 0.65), but not with the more abundant LNA cells. This solidifies findings, mainly from marine systems, that HNA and LNA bacteria could be considered separate functional groups, the former contributing a disproportionately large share of carbon cycling. Taxa selected by the models could predict HNA and LNA absolute cell abundances at all taxonomic levels. Selected operational taxonomic units (OTUs) ranged from low to high relative abundance and were mostly lake system specific (89.5% to 99.2%). A subset of selected OTUs was associated with both LNA and HNA groups (12.5% to 33.3%), suggesting either phenotypic plasticity or within-OTU genetic and physiological heterogeneity. These findings may lead to the identification of system-specific putative ecological indicators for heterotrophic productivity. Generally, our approach allows for the association of OTUs with specific functional groups in diverse ecosystems in order to improve our understanding of (microbial) biodiversity-ecosystem functioning relationships. IMPORTANCE A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production

    Yield and predicted feed quality of different German cultivars of blue lupins (Lupinus angustifolius)

    Get PDF
    In the present work different cultivars of blue lupins were tested at two sites, the experimental farm of the Institute of Organic Farming (IOF-site) at Trenthorst near Hamburg and the experimental station of the Institute of Plant and Soil Science (ICSS-site) at Braunschweig (conventional farming). The field experiments were conducted from 2003 – 2005 at the IOF-site and in 2006 and 2007 at the ICSS-site. At the IOF-site yield was 2,95 t ha-1 on average, whereas the mean yield at the ICSS-site was lower with 2.0 t ha-1. However, a significant interaction between cultivar and year was observed for yield (P<0.001 and P<0.01 for IOF-site and ICSS-site, respectively). At the ICSS-site the cultivars Vitabor, Boltensia, Borlu and Sonet showed the lowest yield. Yield was similar between the branched and determinate cultivars at both sites, but the crude protein content (CP) was in the majority of the cases higher in the branched cultivars. The CP content ranged between 28.2% and 37.8% DM at the IOF-site and between 34.7 and 39.2% DM at the ICSS-site, respectively. The newer cultivars Idefix and Probor, which were tested at ICSS-site in 2006 and 2007, had the highest CP content (39.2 and 38.8% DM). Additionally, the predicted Net Energy for Lactation (NEL) in dairy cow and the predicted Metabolized Energy for pigs (ME) showed interactions between year and cultivar with the exception of ME at the ICSS-site. Cultivars with a high NEL respectively ME were Bora, Boruta, Bolivio and Borlu at the IOF-site and Probor, Borlu, Idefix, Boregine and Boltensia at the ICSS-site

    Development of a multivariable risk model integrating urinary cell DNA methylation and cell-free RNA data for the detection of significant prostate cancer

    Get PDF
    Background: Prostate cancer exhibits severe clinical heterogeneity and there is a critical need for clinically implementable tools able to precisely and noninvasively identify patients that can either be safely removed from treatment pathways or those requiring further follow up. Our objectives were to develop a multivariable risk prediction model through the integration of clinical, urine-derived cell-free messenger RNA (cf-RNA) and urine cell DNA methylation data capable of noninvasively detecting significant prostate cancer in biopsy naïve patients. Methods: Post-digital rectal examination urine samples previously analyzed separately for both cellular methylation and cf-RNA expression within the Movember GAP1 urine biomarker cohort were selected for a fully integrated analysis (n = 207). A robust feature selection framework, based on bootstrap resampling and permutation, was utilized to find the optimal combination of clinical and urinary markers in a random forest model, deemed ExoMeth. Out-of-bag predictions from ExoMeth were used for diagnostic evaluation in men with a clinical suspicion of prostate cancer (PSA ≥ 4 ng/mL, adverse digital rectal examination, age, or lower urinary tract symptoms). Results: As ExoMeth risk score (range, 0-1) increased, the likelihood of high-grade disease being detected on biopsy was significantly greater (odds ratio = 2.04 per 0.1 ExoMeth increase, 95% confidence interval [CI]: 1.78-2.35). On an initial TRUS biopsy, ExoMeth accurately predicted the presence of Gleason score ≥3 + 4, area under the receiver-operator characteristic curve (AUC) = 0.89 (95% CI: 0.84-0.93) and was additionally capable of detecting any cancer on biopsy, AUC = 0.91 (95% CI: 0.87-0.95). Application of ExoMeth provided a net benefit over current standards of care and has the potential to reduce unnecessary biopsies by 66% when a risk threshold of 0.25 is accepted. Conclusion: Integration of urinary biomarkers across multiple assay methods has greater diagnostic ability than either method in isolation, providing superior predictive ability of biopsy outcomes. ExoMeth represents a more holistic view of urinary biomarkers and has the potential to result in substantial changes to how patients suspected of harboring prostate cancer are diagnosed
    • …
    corecore