66 research outputs found

    Robustness of Random Forest-based gene selection methods

    Full text link
    Gene selection is an important part of microarray data analysis because it provides information that can lead to a better mechanistic understanding of an investigated phenomenon. At the same time, gene selection is very difficult because of the noisy nature of microarray data. As a consequence, gene selection is often performed with machine learning methods. The Random Forest method is particularly well suited for this purpose. In this work, four state-of-the-art Random Forest-based feature selection methods were compared in a gene selection context. The analysis focused on the stability of selection because, although it is necessary for determining the significance of results, it is often ignored in similar studies. The comparison of post-selection accuracy in the validation of Random Forest classifiers revealed that all investigated methods were equivalent in this context. However, the methods substantially differed with respect to the number of selected genes and the stability of selection. Of the analysed methods, the Boruta algorithm predicted the most genes as potentially important. The post-selection classifier error rate, which is a frequently used measure, was found to be a potentially deceptive measure of gene selection quality. When the number of consistently selected genes was considered, the Boruta algorithm was clearly the best. Although it was also the most computationally intensive method, the Boruta algorithm's computational demands could be reduced to levels comparable to those of other algorithms by replacing the Random Forest importance with a comparable measure from Random Ferns (a similar but simplified classifier). Despite their design assumptions, the minimal optimal selection methods, were found to select a high fraction of false positives

    Improved Decission Tree Performance using Information Gain for Classification of Covid-19 Survillance Datasets

    Get PDF
    One of the most feared infectious diseases today is COVID-19. The transmission of this disease is quite fast. Patients also sometimes do not have the same symptoms. Overcoming the spread of the pandemic has been widely carried out throughout the world. Apart from the medical method, there are also many other methods, including computerization. Data mining is a discipline that can project data into new knowledge. One of the main functions of data mining is classification. Decision tree is one of the best models to solve classification problems. The number of data attributes can affect the performance of an algorithm. This study uses information gain to select the attribute features of the Covid-19 surveillance dataset. This study proves that there is an increase in the accuracy of the decision tree algorithm by adding information gain feature selection. Previously, the decision tree only had an accuracy rate of 65% for the classification of the Covid-19 surveillance dataset. After pre-processing using information gain, the accuracy rate increased to 75%

    Novel chromatin texture features for the classification of Pap smears

    Get PDF
    This paper presents a set of novel structural texture features for quantifying nuclear chromatin patterns in cells on a conventional Pap smear. The features are derived from an initial segmentation of the chromatin into bloblike texture primitives. The results of a comprehensive feature selection experiment, including the set of proposed structural texture features and a range of different cytology features drawn from the literature, show that two of the four top ranking features are structural texture features. They also show that a combination of structural and conventional features yields a classification performance of 0.954±0.019 (AUC±SE) for the discrimination of normal (NILM) and abnormal (LSIL and HSIL) slides. The results of a second classification experiment, using only normal-appearing cells from both normal and abnormal slides, demonstrates that a single structural texture feature measuring chromatin margination yields a classification performance of 0.815±0.019. Overall the results demonstrate the efficacy of the proposed structural approach and that it is possible to detect malignancy associated changes (MACs) in Papanicoloau stain

    Variable selection for BART: An application to gene regulation

    Get PDF
    We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore