Search CORE

66 research outputs found

Robustness of Random Forest-based gene selection methods

Author: Kursa Miron B.
Publication venue
Publication date: 18/10/2013
Field of study

Gene selection is an important part of microarray data analysis because it provides information that can lead to a better mechanistic understanding of an investigated phenomenon. At the same time, gene selection is very difficult because of the noisy nature of microarray data. As a consequence, gene selection is often performed with machine learning methods. The Random Forest method is particularly well suited for this purpose. In this work, four state-of-the-art Random Forest-based feature selection methods were compared in a gene selection context. The analysis focused on the stability of selection because, although it is necessary for determining the significance of results, it is often ignored in similar studies. The comparison of post-selection accuracy in the validation of Random Forest classifiers revealed that all investigated methods were equivalent in this context. However, the methods substantially differed with respect to the number of selected genes and the stability of selection. Of the analysed methods, the Boruta algorithm predicted the most genes as potentially important. The post-selection classifier error rate, which is a frequently used measure, was found to be a potentially deceptive measure of gene selection quality. When the number of consistently selected genes was considered, the Boruta algorithm was clearly the best. Although it was also the most computationally intensive method, the Boruta algorithm's computational demands could be reduced to levels comparable to those of other algorithms by replacing the Random Forest importance with a comparable measure from Random Ferns (a similar but simplified classifier). Despite their design assumptions, the minimal optimal selection methods, were found to select a high fraction of false positives

arXiv.org e-Print Archive

Springer - Publisher Connector

Improved Decission Tree Performance using Information Gain for Classification of Covid-19 Survillance Datasets

Author: Al Karomi M. Adib
Ivandari Ivandari
Maulana Much. Rifqi
Publication venue: 'Politeknik Negeri Semarang'
Publication date: 18/04/2022
Field of study

One of the most feared infectious diseases today is COVID-19. The transmission of this disease is quite fast. Patients also sometimes do not have the same symptoms. Overcoming the spread of the pandemic has been widely carried out throughout the world. Apart from the medical method, there are also many other methods, including computerization. Data mining is a discipline that can project data into new knowledge. One of the main functions of data mining is classification. Decision tree is one of the best models to solve classification problems. The number of data attributes can affect the performance of an algorithm. This study uses information gain to select the attribute features of the Covid-19 surveillance dataset. This study proves that there is an increase in the accuracy of the decision tree algorithm by adding information gain feature selection. Previously, the decision tree only had an accuracy rate of 65% for the classification of the Covid-19 surveillance dataset. After pre-processing using information gain, the accuracy rate increased to 75%

Portal Jurnal Politeknik Negeri Semarang

Novel chromatin texture features for the classification of Pap smears

Author: Bengtsson Ewert
Ehteshami Bejnordi Babak
Malm Patrik
Mehnert Andrew
Moshavegh Ramin
Sujathan K
Publication venue
Publication date: 01/01/2013
Field of study

This paper presents a set of novel structural texture features for quantifying nuclear chromatin patterns in cells on a conventional Pap smear. The features are derived from an initial segmentation of the chromatin into bloblike texture primitives. The results of a comprehensive feature selection experiment, including the set of proposed structural texture features and a range of different cytology features drawn from the literature, show that two of the four top ranking features are structural texture features. They also show that a combination of structural and conventional features yields a classification performance of 0.954±0.019 (AUC±SE) for the discrimination of normal (NILM) and abnormal (LSIL and HSIL) slides. The results of a second classification experiment, using only normal-appearing cells from both normal and abnormal slides, demonstrates that a single structural texture feature measuring chromatin margination yields a classification performance of 0.815±0.019. Overall the results demonstrate the efficacy of the proposed structural approach and that it is possible to detect malignancy associated changes (MACs) in Papanicoloau stain

Chalmers Research

Chalmers Publication Library

Variable selection for BART: An application to gene regulation

Author: Bleich Justin
George Edward I.
Jensen Shane T.
Kapelner Adam
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2014
Field of study

We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

ScholarlyCommons@Penn

Store Attribute Weighting for Clustering in Fast Fashion

Author: Ana Cristina Neto Andrade
Publication venue
Publication date: 08/07/2019
Field of study

Repositório Aberto da Universidade do Porto