Search CORE

10,976 research outputs found

A Global Discretization Approach to Handle Numerical Attributes as Preprocessing

Author: Wu Xun
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2015
Field of study

Discretization is a common technique to handle numerical attributes in data mining, and it divides continuous values into several intervals by defining multiple thresholds. Decision tree learning algorithms, such as C4.5 and random forests, are able to deal with numerical attributes by applying discretization technique and transforming them into nominal attributes based on one impurity-based criterion, such as information gain or Gini gain. However, there is no doubt that a considerable amount of distinct values are located in the same interval after discretization, through which digital information delivered by the original continuous values are lost. In this thesis, we proposed a global discretization method that can keep the information within the original numerical attributes by expanding them into multiple nominal ones based on each of the candidate cut-point values. The discretized data set, which includes only nominal attributes, evolves from the original data set. We analyzed the problem by applying two decision tree learning algorithms (C4.5 and random forests) respectively to each of the twelve pairs of data sets (original and discretized data sets) and evaluating the performances (prediction accuracy rate) of the obtained classification models in Weka Experimenter. This is followed by two separate Wilcoxon tests (each test for one learning algorithm) to decide whether there is a level of statistical significance among these paired data sets. Results of both tests indicate that there is no clear difference in terms of performances by using the discretized data sets compared to the original ones. But in some cases, the discretized models of both classifiers slightly outperform their paired original models

KU ScholarWorks

Learning decision trees in continuous space

Author: Dombi József
Zsiros Ákos
Publication venue
Publication date: 01/01/2001
Field of study

Two problems of the ID3 and C4.5 decision tree building methods will be mentioned and solutions will be suggested on them. First, in both methods a Gain-type criteria is used to compare the applicability of possible tests, which derives from the entropy function. We are going to propose a new measure instead of the entropy function, which comes from the measure of fuzziness using a monotone fuzzy operator. It is more natural and much simpler to compute in case of concept learning (when elements belong to only two classes: positive and negative). Second, the well-known extension of the ID3 method for handling continuous attributes (C4.5) is based on discretization of attribute values and in it the decision space is separated with axis-parallel hyperplanes. In our proposed new method (CDT) continuous attributes are handled without discretization, and arbitrary geometric figures are used for separation of decision space, like hyperplanes in general position, spheres and ellipsoids. The power of our new method is going to be demonstrated oh a few examples

University of Szeged

Improving the Evolutionary Coding for Machine Learning Tasks

Author: Aguilar Ruiz Jesús Salvador
Riquelme Santos José Cristóbal
Valle Sevillano Carmelo del
Publication venue: 'IOS Press'
Publication date: 01/01/2002
Field of study

The most influential factors in the quality of the solutions found by an evolutionary algorithm are a correct coding of the search space and an appropriate evaluation function of the potential solutions. The coding of the search space for the obtaining of decision rules is approached, i.e., the representation of the individuals of the genetic population. Two new methods for encoding discrete and continuous attributes are presented. Our “natural coding” uses one gene per attribute (continuous or discrete) leading to a reduction in the search space. Genetic operators for this approached natural coding are formally described and the reduction of the size of the search space is analysed for several databases from the UCI machine learning repository.Comisión Interministerial de Ciencia y Tecnología TIC1143–C03–0

idUS. Depósito de Investigación Universidad de Sevilla

On the usage of the probability integral transform to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems

Author: Bustince Humberto
Elkano Mikel
Galar Mikel
Uriz Mikel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/02/2019
Field of study

We present a new distributed fuzzy partitioning method to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems. The proposed algorithm builds a fixed number of fuzzy sets for all variables and adjusts their shape and position to the real distribution of training data. A two-step process is applied : 1) transformation of the original distribution into a standard uniform distribution by means of the probability integral transform. Since the original distribution is generally unknown, the cumulative distribution function is approximated by computing the q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy partition in the transformed attribute space using a fixed number of equally distributed triangular membership functions. Despite the aforementioned transformation, the definition of every fuzzy set in the original space can be recovered by applying the inverse cumulative distribution function (also known as quantile function). The experimental results reveal that the proposed methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT) induction algorithm to maintain classification accuracy with up to 6 million fewer leaves.Comment: Appeared in 2018 IEEE International Congress on Big Data (BigData Congress). arXiv admin note: text overlap with arXiv:1902.0935

arXiv.org e-Print Archive

Crossref

Determinants of Long-term Economic Development: An Empirical Cross-country Study Involving Rough Sets Theory and Rule Induction

Author: Obersteiner Michael
Wilk Szymon
Publication venue
Publication date
Field of study

Empirical findings on determinants of long-term economic growth are numerous, sometimes inconsistent, highly exciting and still incomplete. The empirical analysis was almost exclusively carried out by standard econometrics. This study compares results gained by cross-country regressions as reported in the literature with those gained by the rough sets theory and rule induction. The main advantages of using rough sets are being able to classify classes and to discretize. Thus, we do not have to deal with distributional, independence, (log-)linearity, and many other assumptions, but can keep the data as they are. The main difference between regression results and rough sets is that most education and human capital indicators can be labeled as robust attributes. In addition, we find that political indicators enter in a non-linear fashion with respect to growth.Economic growth, Rough sets, Rule induction

Research Papers in Economics