2,783 research outputs found
Feature Selection with the Boruta Package
This article describes a R package Boruta, implementing a novel feature selection algorithm for finding \emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.
Robustness of Random Forest-based gene selection methods
Gene selection is an important part of microarray data analysis because it
provides information that can lead to a better mechanistic understanding of an
investigated phenomenon. At the same time, gene selection is very difficult
because of the noisy nature of microarray data. As a consequence, gene
selection is often performed with machine learning methods. The Random Forest
method is particularly well suited for this purpose. In this work, four
state-of-the-art Random Forest-based feature selection methods were compared in
a gene selection context. The analysis focused on the stability of selection
because, although it is necessary for determining the significance of results,
it is often ignored in similar studies.
The comparison of post-selection accuracy in the validation of Random Forest
classifiers revealed that all investigated methods were equivalent in this
context. However, the methods substantially differed with respect to the number
of selected genes and the stability of selection. Of the analysed methods, the
Boruta algorithm predicted the most genes as potentially important.
The post-selection classifier error rate, which is a frequently used measure,
was found to be a potentially deceptive measure of gene selection quality. When
the number of consistently selected genes was considered, the Boruta algorithm
was clearly the best. Although it was also the most computationally intensive
method, the Boruta algorithm's computational demands could be reduced to levels
comparable to those of other algorithms by replacing the Random Forest
importance with a comparable measure from Random Ferns (a similar but
simplified classifier). Despite their design assumptions, the minimal optimal
selection methods, were found to select a high fraction of false positives
Influence of chronic azithromycin treatment on the composition of the oropharyngeal microbial community in patients with severe asthma
Background: This study of the oropharyngeal microbiome complements the previously published AZIthromycin in Severe ASThma (AZISAST) clinical trial, where the use of azithromycin was assessed in subjects with exacerbationprone severe asthma. Here, we determined the composition of the oropharyngeal microbial community by means of deep sequencing of the amplified 16S rRNA gene in oropharyngeal swabs from patients with exacerbationprone severe asthma, at baseline and during and after 6 months treatment with azithromycin or placebo.
Results: A total of 1429 OTUs were observed, of which only 59 were represented by more than 0.02% of the reads. Firmicutes, Bacteroidetes, Fusobacteria, Proteobacteria and Actinobacteria were the most abundant phyla and Streptococcus and Prevotella were the most abundant genera in all the samples. Thirteen species only accounted for two thirds of the reads and two species only, i.e. Prevotella melaninogenica and Streptococcus mitis/pneumoniae, accounted for one fourth of the reads. We found that the overall composition of the oropharyngeal microbiome in patients with severe asthma is comparable to that of the healthy population, confirming the results of previous studies. Long term treatment (6 months) with azithromycin increased the species Streptococcus salivarius approximately 5-fold and decreased the species Leptotrichia wadei approximately 5-fold. This was confirmed by Boruta feature selection, which also indicated a significant decrease of L. buccalis/L. hofstadtii and of Fusobacterium nucleatum. Four of the 8 treated patients regained their initial microbial composition within one month after cessation of treatment.
Conclusions: Despite large diversity of the oropharyngeal microbiome, only a few species predominate. We confirm the absence of significant differences between the oropharyngeal microbiomes of people with and without severe asthma. Possibly, long term azithromycin treatment may have long term effects on the composition of the oropharygeal microbiome in half of the patients
Randomized lasso links microbial taxa with aquatic functional groups inferred from flow cytometry
High-nucleic-acid (HNA) and low-nucleic-acid (LNA) bacteria are two operational groups identified by flow cytometry (FCM) in aquatic systems. A number of reports have shown that HNA cell density correlates strongly with heterotrophic production, while LNA cell density does not. However, which taxa are specifically associated with these groups, and by extension, productivity has remained elusive. Here, we addressed this knowledge gap by using a machine learning-based variable selection approach that integrated FCM and 16S rRNA gene sequencing data collected from 14 freshwater lakes spanning a broad range in physicochemical conditions. There was a strong association between bacterial heterotrophic production and HNA absolute cell abundances (R-2 = 0.65), but not with the more abundant LNA cells. This solidifies findings, mainly from marine systems, that HNA and LNA bacteria could be considered separate functional groups, the former contributing a disproportionately large share of carbon cycling. Taxa selected by the models could predict HNA and LNA absolute cell abundances at all taxonomic levels. Selected operational taxonomic units (OTUs) ranged from low to high relative abundance and were mostly lake system specific (89.5% to 99.2%). A subset of selected OTUs was associated with both LNA and HNA groups (12.5% to 33.3%), suggesting either phenotypic plasticity or within-OTU genetic and physiological heterogeneity. These findings may lead to the identification of system-specific putative ecological indicators for heterotrophic productivity. Generally, our approach allows for the association of OTUs with specific functional groups in diverse ecosystems in order to improve our understanding of (microbial) biodiversity-ecosystem functioning relationships.
IMPORTANCE A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production
Development of Neurofuzzy Architectures for Electricity Price Forecasting
In 20th century, many countries have liberalized their electricity market. This power markets liberalization has directed generation companies as well as wholesale buyers to undertake a greater intense risk exposure compared to the old centralized framework. In this framework, electricity price prediction has become crucial for any market player in their decision‐making process as well as strategic planning. In this study, a prototype asymmetric‐based neuro‐fuzzy network (AGFINN) architecture has been implemented for short‐term electricity prices forecasting for ISO New England market. AGFINN framework has been designed through two different defuzzification schemes. Fuzzy clustering has been explored as an initial step for defining the fuzzy rules while an asymmetric Gaussian membership function has been utilized in the fuzzification part of the model. Results related to the minimum and maximum electricity prices for ISO New England, emphasize the superiority of the proposed model over well‐established learning‐based models
Prediction of peptide and protein propensity for amyloid formation
Understanding which peptides and proteins have the potential to undergo amyloid formation and what driving forces are responsible for amyloid-like fiber formation and stabilization remains limited. This is mainly because proteins that can undergo structural changes, which lead to amyloid formation, are quite diverse and share no obvious sequence or structural homology, despite the structural similarity found in the fibrils. To address these issues, a novel approach based on recursive feature selection and feed-forward neural networks was undertaken to identify key features highly correlated with the self-assembly problem. This approach allowed the identification of seven physicochemical and biochemical properties of the amino acids highly associated with the self-assembly of peptides and proteins into amyloid-like fibrils (normalized frequency of β-sheet, normalized frequency of β-sheet from LG, weights for β-sheet at the window position of 1, isoelectric point, atom-based hydrophobic moment, helix termination parameter at position j+1 and ΔGº values for peptides extrapolated in 0 M urea). Moreover, these features enabled the development of a new predictor (available at http://cran.r-project.org/web/packages/appnn/index.html) capable of accurately and reliably predicting the amyloidogenic propensity from the polypeptide sequence alone with a prediction accuracy of 84.9 % against an external validation dataset of sequences with experimental in vitro, evidence of amyloid formation
- …
