Search CORE

2 research outputs found

Application of an efficient Bayesian discretization method to biomedical data

Author: Cooper Gregory F
Gopalakrishnan Vanathi
Lustgarten Jonathan L
Visweswaran Shyam
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background\ud Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD) method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI) discretization method, which is commonly used for discretization.\ud \ud Results\ud On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI.\ud \ud Conclusions\ud On a range of biomedical datasets, a Bayesian discretization method (EBD) yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data

Crossref

Springer - Publisher Connector

PubMed Central

D-Scholarship@Pitt

Cost-sensitive Discretization of Numeric Attributes

Author: Koen Vanhoof
Tom Brijs
Publication venue
Publication date
Field of study

. Many algorithms in decision tree learning have not been designed to handle numerically-valued attributes very well. Therefore, discretization of the continuous feature space has to be carried out. In this article we introduce the concept of cost-sensitive discretization as a preprocessing step to induction of a classifier and as an elaboration of the error-based discretization method to obtain an optimal multi-interval splitting for each numeric attribute. A transparent description of the method and steps involved in cost-sensitive discretization is given. We also provide an assessment of its performance against two other well-known methods, i.e. entropy-based discretization and pure error-based discretization on an authentic financial dataset. From an algorithmic perspective, we show that an important deficiency of error-based discretization methods can be solved by introducing costs. From the application perspective, we discovered that using a discretization method is ..

CiteSeerX