269 research outputs found

    Variable Selection Bias in Classification Trees Based on Imprecise Probabilities

    Get PDF
    Classification trees based on imprecise probabilities provide an advancement of classical classification trees. The Gini Index is the default splitting criterion in classical classification trees, while in classification trees based on imprecise probabilities, an extension of the Shannon entropy has been introduced as the splitting criterion. However, the use of these empirical entropy measures as split selection criteria can lead to a bias in variable selection, such that variables are preferred for features other than their information content. This bias is not eliminated by the imprecise probability approach. The source of variable selection bias for the estimated Shannon entropy, as well as possible corrections, are outlined. The variable selection performance of the biased and corrected estimators are evaluated in a simulation study. Additional results from research on variable selection bias in classical classification trees are incorporated, implying further investigation of alternative split selection criteria in classification trees based on imprecise probabilities

    Completing an uncertainty criterion of classification

    Get PDF
    We present a variation of a method of classification based in uncertainty on credal set. Similarly to its origin it use the imprecise Dirichlet model to create the credal set and the same uncertainty measures. It take into account sets of two variables to reduce the uncertainty and to seek the direct relations between the variables in the data base and the variable to be classified. The success are equivalent to the success of the first method except in those where there are a direct relations between some variables that decide the value of the variable to be classified where we have a notable improvement

    Upgrading the Fusion of Imprecise Classifiers

    Get PDF
    Imprecise classification is a relatively new task within Machine Learning. The difference with standard classification is that not only is one state of the variable under study determined, a set of states that do not have enough information against them and cannot be ruled out is determined as well. For imprecise classification, a mode called an Imprecise Credal Decision Tree (ICDT) that uses imprecise probabilities and maximum of entropy as the information measure has been presented. A difficult and interesting task is to show how to combine this type of imprecise classifiers. A procedure based on the minimum level of dominance has been presented; though it represents a very strong method of combining, it has the drawback of an important risk of possible erroneous prediction. In this research, we use the second-best theory to argue that the aforementioned type of combination can be improved through a new procedure built by relaxing the constraints. The new procedure is compared with the original one in an experimental study on a large set of datasets, and shows improvement.UGR-FEDER funds under Project A-TIC-344-UGR20FEDER/Junta de Andalucía-Consejería de Transformación Económica, Industria, Conocimiento y Universidades” under Project P20_0015

    Improving the Naive Bayes Classifier via a Quick Variable Selection Method Using Maximum of Entropy

    Get PDF
    Variable selection methods play an important role in the field of attribute mining. The Naive Bayes (NB) classifier is a very simple and popular classification method that yields good results in a short processing time. Hence, it is a very appropriate classifier for very large datasets. The method has a high dependence on the relationships between the variables. The Info-Gain (IG) measure, which is based on general entropy, can be used as a quick variable selection method. This measure ranks the importance of the attribute variables on a variable under study via the information obtained from a dataset. The main drawback is that it is always non-negative and it requires setting the information threshold to select the set of most important variables for each dataset. We introduce here a new quick variable selection method that generalizes the method based on the Info-Gain measure. It uses imprecise probabilities and the maximum entropy measure to select the most informative variables without setting a threshold. This new variable selection method, combined with the Naive Bayes classifier, improves the original method and provides a valuable tool for handling datasets with a very large number of features and a huge amount of data, where more complex methods are not computationally feasible.This work has been supported by the Spanish “Ministerio de Economía y Competitividad” and by “Fondo Europeo de Desarrollo Regional” (FEDER) under Project TEC2015-69496-R

    Maximum of entropy for belief intervals under Evidence Theory

    Get PDF
    The Dempster-Shafer Theory (DST) or Evidence Theory has been commonly used to deal with uncertainty. It is based on the basic probability assignment concept (BPA). The upper entropy on the credal set associated with a BPA is the only uncertainty measure in DST that verifies all the necessary mathematical properties and behaviors. Nonetheless, its computation is notably complex. For this reason, many alternatives to this measure have been recently proposed, but they do not satisfy most of the mathematical requirements and present some undesirable behaviors. Belief intervals have been frequently employed to quantify uncertainty in DST in the last years, and they can represent the uncertainty-basedinformation better than a BPA. In this research, we develop a new uncertainty measure that consists of the maximum of entropy on the credal set corresponding to belief intervals for singletons. It verifies all the crucial mathematical requirements and presents good behavior, solving most of the shortcomings found in uncertainty measures proposed recently. Moreover, its calculation is notably easier than the upper entropy on the credal set associated with the BPA. Therefore, our proposed uncertainty measure is more suitable to be used in practical applications.Spanish Ministerio de Economia y Competitividad TIN2016-77902-C3-2-PEuropean Union (EU) TEC2015-69496-

    Penerapan Metode Average Gain, Threshold Pruning Dan Cost Complexity Pruning Untuk Split Atribut Pada Algoritma C4.5

    Full text link
    C4.5 is a supervised learning classifier to establish a Decision Tree of data. Split attribute is main process in the formation of a decision tree in C4.5. Split attribute in C4.5 can not be overcome in any misclassification cost split so the effect on the performance of the classifier. After the split attributes, the next process is pruning. Pruning is process to cut or eliminate some of unnecessary branches. Branch or node that is not needed can cause the size of Decision Tree to be very large and it is called over- fitting. Over- fitting is state of the art for this time. Methods for split attributes are Gini Index, Information Gain, Gain Ratio and Average Gain which proposed by Mitchell. Average Gain not only overcome the weakness in the Information Gain but also help to solve the problems of Gain Ratio. Attribute split method which proposed in this research is use average gain value multiplied by the difference of misclassification. While the technique of pruning is done by combining threshold pruning and cost complexity pruning. In this research, testing the proposed method will be applied to datasets and then the results of performance will be compared with results split method performance attributes using the Gini Index, Information Gain and Gain Ratio. The selecting method of split attributes using average gain that multiplied by the difference of misclassification can improve the performance of classifiying C4.5. This is demonstrated through the Friedman test that the proposed split method attributes, combined with threshold pruning and cost complexity pruning have accuracy ratings in rank 1. A Decision Tree formed by the proposed method are smaller
    • …
    corecore