249 research outputs found
Global Entropy Based Greedy Algorithm for discretization
Discretization algorithm is a crucial step to not only achieve summarization of continuous attributes but also better performance in classification that requires discrete values as input. In this thesis, I propose a supervised discretization method, Global Entropy Based Greedy algorithm, which is based on the Information Entropy Minimization. Experimental results show that the proposed method outperforms state of the art methods with well-known benchmarking datasets. To further improve the proposed method, a new approach for stop criterion that is based on the change rate of entropy was also explored. From the experimental analysis, it is noticed that the threshold based on the decreasing rate of entropy could be more effective than a constant number of intervals in the classification such as C5.0
Merging of Numerical Intervals in Entropy-Based Discretization
As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the domains of the data records. In multiple-scanning discretization, the last step is the merging of neighboring intervals in discretized datasets as a kind of postprocessing. Our objective is to check how the error rate, measured by tenfold cross validation within the C4.5 system, is affected by such merging. We conducted experiments on 17 numerical datasets, using the same setup of multiple scanning, with three different options for merging: no merging at all, merging based on the smallest entropy, and merging based on the biggest entropy. As a result of the Friedman rank sum test (5% significance level) we concluded that the differences between all three approaches are statistically insignificant. There is no universally best approach. Then, we repeated all experiments 30 times, recording averages and standard deviations. The test of the difference between averages shows that, for a comparison of no merging with merging based on the smallest entropy, there are statistically highly significant differences (with a 1% significance level). In some cases, the smaller error rate is associated with no merging, in some cases the smaller error rate is associated with merging based on the smallest entropy. A comparison of no merging with merging based on the biggest entropy showed similar results. So, our final conclusion was that there are highly significant differences between no merging and merging, depending on the dataset. The best approach should be chosen by trying all three approaches
A Max-relevance-min-divergence Criterion for Data Discretization with Applications on Naive Bayes
In many classification models, data is discretized to better estimate its
distribution. Existing discretization methods often target at maximizing the
discriminant power of discretized data, while overlooking the fact that the
primary target of data discretization in classification is to improve the
generalization performance. As a result, the data tend to be over-split into
many small bins since the data without discretization retain the maximal
discriminant information. Thus, we propose a Max-Dependency-Min-Divergence
(MDmD) criterion that maximizes both the discriminant information and
generalization ability of the discretized data. More specifically, the
Max-Dependency criterion maximizes the statistical dependency between the
discretized data and the classification variable while the Min-Divergence
criterion explicitly minimizes the JS-divergence between the training data and
the validation data for a given discretization scheme. The proposed MDmD
criterion is technically appealing, but it is difficult to reliably estimate
the high-order joint distributions of attributes and the classification
variable. We hence further propose a more practical solution,
Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute
is discretized separately, by simultaneously maximizing the discriminant
information and the generalization ability of the discretized data. The
proposed MRmD is compared with the state-of-the-art discretization algorithms
under the naive Bayes classification framework on 45 machine-learning benchmark
datasets. It significantly outperforms all the compared methods on most of the
datasets.Comment: Under major revision of Pattern Recognitio
A Comparison of Four Approaches to Discretization Based on Entropy †
We compare four discretization methods, all based on entropy: the original C4.5 approach to discretization, two globalized methods, known as equal interval width and equal frequency per interval, and a relatively new method for discretization called multiple scanning using the C4.5 decision tree generation system. The main objective of our research is to compare the quality of these four methods using two criteria: an error rate evaluated by ten-fold cross-validation and the size of the decision tree generated by C4.5. Our results show that multiple scanning is the best discretization method in terms of the error rate and that decision trees generated from datasets discretized by multiple scanning are simpler than decision trees generated directly by C4.5 or generated from datasets discretized by both globalized discretization methods
Scalable CAIM Discretization on Multiple GPUs Using Concurrent Kernels
CAIM(Class-Attribute InterdependenceMaximization) is one of the stateof-
the-art algorithms for discretizing data for which classes are known. However, it
may take a long time when run on high-dimensional large-scale data, with large number
of attributes and/or instances. This paper presents a solution to this problem by
introducing a GPU-based implementation of the CAIM algorithm that significantly
speeds up the discretization process on big complex data sets. The GPU-based implementation
is scalable to multiple GPU devices and enables the use of concurrent
kernels execution capabilities ofmodernGPUs. The CAIMGPU-basedmodel is evaluated
and compared with the original CAIM using single and multi-threaded parallel
configurations on 40 data sets with different characteristics. The results show great
speedup, up to 139 times faster using 4 GPUs, which makes discretization of big
data efficient and manageable. For example, discretization time of one big data set is
reduced from 2 hours to less than 2 minute
A Global Discretization Approach to Handle Numerical Attributes as Preprocessing
Discretization is a common technique to handle numerical attributes in data mining, and it divides continuous values into several intervals by defining multiple thresholds. Decision tree learning algorithms, such as C4.5 and random forests, are able to deal with numerical attributes by applying discretization technique and transforming them into nominal attributes based on one impurity-based criterion, such as information gain or Gini gain. However, there is no doubt that a considerable amount of distinct values are located in the same interval after discretization, through which digital information delivered by the original continuous values are lost. In this thesis, we proposed a global discretization method that can keep the information within the original numerical attributes by expanding them into multiple nominal ones based on each of the candidate cut-point values. The discretized data set, which includes only nominal attributes, evolves from the original data set. We analyzed the problem by applying two decision tree learning algorithms (C4.5 and random forests) respectively to each of the twelve pairs of data sets (original and discretized data sets) and evaluating the performances (prediction accuracy rate) of the obtained classification models in Weka Experimenter. This is followed by two separate Wilcoxon tests (each test for one learning algorithm) to decide whether there is a level of statistical significance among these paired data sets. Results of both tests indicate that there is no clear difference in terms of performances by using the discretized data sets compared to the original ones. But in some cases, the discretized models of both classifiers slightly outperform their paired original models
ur-CAIM: Improved CAIM Discretization for Unbalanced and Balanced Data
Supervised discretization is one of basic data preprocessing
techniques used in data mining. CAIM (Class-
Attribute InterdependenceMaximization) is a discretization
algorithm of data for which the classes are known. However,
new arising challenges such as the presence of unbalanced
data sets, call for new algorithms capable of handling them,
in addition to balanced data. This paper presents a new discretization
algorithm named ur-CAIM, which improves on
the CAIM algorithm in three important ways. First, it generates
more flexible discretization schemes while producing
a small number of intervals. Second, the quality of the intervals
is improved based on the data classes distribution,
which leads to better classification performance on balanced
and, especially, unbalanced data. Third, the runtime of the
algorithm is lower than CAIM’s. The algorithm has been
designed free-parameter and it self-adapts to the problem
complexity and the data class distribution. The ur-CAIM
was compared with 9 well-known discretization methods
on 28 balanced, and 70 unbalanced data sets. The results
obtained were contrasted through non-parametric statistical
tests, which show that our proposal outperforms CAIM and
many of the other methods on both types of data but especially
on unbalanced data, which is its significant advantage
Reduction of Irrelevant Features in Oceanic Satellite Images by means of Bayesian Networks
This paper describes the use of Bayesian networks for the reduction of irrelevant features [1,2] in the recognition of oceanic structures in satellite images. Bayesian networks are used to validate the symbolic knowledge -provided by neuro symbolic or HLKPs (High Level Knowledge Processors) nets- and the numeric knowledge. This provides an automatic interpretation of images. The main objective of this work is the construction of an automatic recognition system for processing AVHRR (Advanced Very High Resolution Radiometer) images from NOAA (National Oceanographic and Atmospheric Administration) satellites to detect and locate oceanic phenomena of interest like upwellings, eddies and island wakes. With this aim, this paper reports on a methodology of knowledge selection and validation. In knowledge selection, filter measures are used. For knowledge validation, Bayesian networks (Naïve Bayes, TAN and KDB) are evaluated
Event-Level Pattern Discovery for Large Mixed-Mode Database
For a large mixed-mode database, how to discretize its continuous data into interval events is still a practical approach. If there are no class labels for the database, we have nohelpful correlation references to such task Actually a large relational database may contain various correlated attribute clusters. To handle these kinds of problems, we first have to partition the databases into sub-groups of attributes containing some sort of correlated relationship. This process has become known as attribute clustering, and it is an important way to reduce our search in looking for or discovering patterns Furthermore, once correlated attribute groups are obtained, from each of them, we could find the most representative attribute with the strongest interdependence with all other attributes in that cluster, and use it as a candidate like a a class label of that group. That will set up a correlation attribute to drive the discretization of the other continuous data in each attribute cluster. This thesis provides the theoretical framework, the methodology and the computational system to achieve that goal
- …