46 research outputs found

    Upgrading the Fusion of Imprecise Classifiers

    Get PDF
    Imprecise classification is a relatively new task within Machine Learning. The difference with standard classification is that not only is one state of the variable under study determined, a set of states that do not have enough information against them and cannot be ruled out is determined as well. For imprecise classification, a mode called an Imprecise Credal Decision Tree (ICDT) that uses imprecise probabilities and maximum of entropy as the information measure has been presented. A difficult and interesting task is to show how to combine this type of imprecise classifiers. A procedure based on the minimum level of dominance has been presented; though it represents a very strong method of combining, it has the drawback of an important risk of possible erroneous prediction. In this research, we use the second-best theory to argue that the aforementioned type of combination can be improved through a new procedure built by relaxing the constraints. The new procedure is compared with the original one in an experimental study on a large set of datasets, and shows improvement.UGR-FEDER funds under Project A-TIC-344-UGR20FEDER/Junta de Andalucía-Consejería de Transformación Económica, Industria, Conocimiento y Universidades” under Project P20_0015

    Ensemble methods for classification trees under imprecise probabilities

    Get PDF

    Direct Nonparametric Predictive Inference Classification Trees

    Get PDF
    Classification is the task of assigning a new instance to one of a set of predefined categories based on the attributes of the instance. A classification tree is one of the most commonly used techniques in the area of classification. In recent years, many statistical methodologies have been developed to make inferences using imprecise probability theory, one of which is nonparametric predictive inference (NPI). NPI has been developed for different types of data and has been successfully applied to several fields, including classification. Due to the predictive nature of NPI, it is well suited for classification, as the nature of classification is explicitly predictive as well. In this thesis, we introduce a novel classification tree algorithm which we call the Direct Nonparametric Predictive Inference (D-NPI) classification algorithm. The D-NPI algorithm is completely based on the NPI approach, and it does not use any other assumptions. As a first step for developing the D-NPI classification method, we restrict our focus to binary and multinomial data types. The D-NPI algorithm uses a new split criterion called Correct Indication (CI), which is completely based on NPI and does not use any additional concepts such as entropy. The CI reflects how informative attribute variables are, hence if the attribute variable is very informative, it gives high NPI lower and upper probabilities for CI. In addition, the CI reports the strength of the evidence that the attribute variables will indicate regarding the possible class state for future instances, based on the data. The performance of the D-NPI classification algorithm is compared against several classification algorithms from the literature, including some imprecise probability algorithms, using different evaluation measures. The experimental results indicate that the D-NPI classification algorithm performs well and tends to slightly outperform other classification algorithms. Finally, a study of the D-NPI classification tree algorithm with noisy data is presented. Noisy data are data that contain incorrect values for the attribute variables or class variable. The performance of the D-NPI classification tree algorithm with noisy data is studied and compared to other classification tree algorithms when different levels of random noise are added to the class variable or to attribute variables. The results indicate that the D-NPI classification algorithm performs well with class noise and slightly outperforms other classification algorithms, while there is no single classification algorithm that acts as the best performing algorithm with attribute noise

    Contributions to reasoning on imprecise data

    Get PDF
    This thesis contains four contributions which advocate cautious statistical modelling and inference. They achieve it by taking sets of models into account, either directly or indirectly by looking at compatible data situations. Special care is taken to avoid assumptions which are technically convenient, but reduce the uncertainty involved in an unjustified manner. This thesis provides methods for cautious statistical modelling and inference, which are able to exhaust the potential of precise and vague data, motivated by different fields of application, ranging from political science to official statistics. At first, the inherently imprecise Nonparametric Predictive Inference model is involved in the cautious selection of splitting variables in the construction of imprecise classification trees, which are able to describe a structure and allow for a reasonably high predictive power. Dependent on the interpretation of vagueness, different strategies for vague data are then discussed in terms of finite random closed sets: On the one hand, the data to be analysed are regarded as set-valued answers of an item in a questionnaire, where each possible answer corresponding to a subset of the sample space is interpreted as a separate entity. By this the finite random set is reduced to an (ordinary) random variable on a transformed sample space. The context of application is the analysis of voting intentions, where it is shown that the presented approach is able to characterise the undecided in a more detailed way, which common approaches are not able to. Altough the presented analysis, regarded as a first step, is carried out on set-valued data, which are suitably self-constructed with respect to the scientific research question, it still clearly demonstrates that the full potential of this quite general framework is not exhausted. It is capable of dealing with more complex applications. On the other hand, the vague data are produced by set-valued single imputation (imprecise imputation) where the finite random sets are interpreted as being the result of some (unspecified) coarsening. The approach is presented within the context of statistical matching, which is used to gain joint knowledge on features that were not jointly collected in the initial data production. This is especially relevant in data production, e.g. in official statistics, as it allows to fuse the information of already accessible data sets into a new one, without the requirement of actual data collection in the field. Finally, in order to share data, they need to be suitably anonymised. For the specific class of anonymisation techniques of microaggregation, its ability to infer on generalised linear regression models is evaluated. Therefore, the microaggregated data are regarded as a set of compatible, unobserved underlying data situations. Two strategies to follow are proposed. At first, a maximax-like optimisation strategy is pursued, in which the underlying unobserved data are incorporated into the regression model as nuisance parameters, providing a concise yet over-optimistic estimation of the regression coefficients. Secondly, an approach in terms of partial identification, which is inherently more cautious than the previous one, is applied to estimate the set of all regression coefficients that are obtained by performing the estimation on each compatible data situation. Vague data are deemed favourable to precise data as they additionally encompass the uncertainty of the individual observation, and therefore they have a higher informational value. However, to the present day, there are few (credible) statistical models that are able to deal with vague or set-valued data. For this reason, the collection of such data is neglected in data production, disallowing such models to exhaust their full potential. This in turn prevents a throughout evaluation, negatively affecting the (further) development of such models. This situation is a variant of the chicken or egg dilemma. The ambition of this thesis is to break this cycle by providing actual methods for dealing with vague data in relevant situations in practice, to stimulate the required data production.Diese Schrift setzt sich in vier Beiträgen für eine vorsichtige statistische Modellierung und Inferenz ein. Dieses wird erreicht, indem man Mengen von Modellen betrachtet, entweder direkt oder indirekt über die Interpretation der Daten als Menge zugrunde liegender Datensituationen. Besonderer Wert wird dabei darauf gelegt, Annahmen zu vermeiden, die zwar technisch bequem sind, aber die zugrunde liegende Unsicherheit der Daten in ungerechtfertigter Weise reduzieren. In dieser Schrift werden verschiedene Methoden der vorsichtigen Modellierung und Inferenz vorgeschlagen, die das Potential von präzisen und unscharfen Daten ausschöpfen können, angeregt von unterschiedlichen Anwendungsbereichen, die von Politikwissenschaften bis zur amtlichen Statistik reichen. Zuerst wird das Modell der Nonparametrischen Prädiktiven Inferenz, welches per se unscharf ist, in der vorsichtigen Auswahl von Split-Variablen bei der Erstellung von Klassifikationsbäumen verwendet, die auf Methoden der Imprecise Probabilities fußen. Diese Bäume zeichnen sich dadurch aus, dass sie sowohl eine Struktur beschreiben, als auch eine annehmbar hohe Prädiktionsgüte aufweisen. In Abhängigkeit von der Interpretation der Unschärfe, werden dann verschiedene Strategien für den Umgang mit unscharfen Daten im Rahmen von finiten Random Sets erörtert. Einerseits werden die zu analysierenden Daten als mengenwertige Antwort auf eine Frage in einer Fragebogen aufgefasst. Hierbei wird jede mögliche (multiple) Antwort, die eine Teilmenge des Stichprobenraumes darstellt, als eigenständige Entität betrachtet. Somit werden die finiten Random Sets auf (gewöhnliche) Zufallsvariablen reduziert, die nun in einen transformierten Raum abbilden. Im Rahmen einer Analyse von Wahlabsichten hat der vorgeschlagene Ansatz gezeigt, dass die Unentschlossenen mit ihm genauer charakterisiert werden können, als es mit den gängigen Methoden möglich ist. Obwohl die vorgestellte Analyse, betrachtet als ein erster Schritt, auf mengenwertige Daten angewendet wird, die vor dem Hintergrund der wissenschaftlichen Forschungsfrage in geeigneter Weise selbst konstruiert worden sind, zeigt diese dennoch klar, dass die Möglichkeiten dieses generellen Ansatzes nicht ausgeschöpft sind, so dass er auch in komplexeren Situationen angewendet werden kann. Andererseits werden unscharfe Daten durch eine mengenwertige Einfachimputation (imprecise imputation) erzeugt. Hier werden die finiten Random Sets als Ergebnis einer (unspezifizierten) Vergröberung interpretiert. Der Ansatz wird im Rahmen des Statistischen Matchings vorgeschlagen, das verwendet wird, um gemeinsame Informationen über ursprünglich nicht zusammen erhobene Merkmale zur erhalten. Dieses ist insbesondere relevant bei der Datenproduktion, beispielsweise in der amtlichen Statistik, weil es erlaubt, die verschiedenartigen Informationen aus unterschiedlichen bereits vorhandenen Datensätzen zu einen neuen Datensatz zu verschmelzen, ohne dass dafür tatsächlich Daten neu erhoben werden müssen. Zudem müssen die Daten für den Datenaustausch in geeigneter Weise anonymisiert sein. Für die spezielle Klasse der Anonymisierungstechnik der Mikroaggregation wird ihre Eignung im Hinblick auf die Verwendbarkeit in generalisierten linearen Regressionsmodellen geprüft. Hierfür werden die mikroaggregierten Daten als eine Menge von möglichen, unbeobachtbaren zu Grunde liegenden Datensituationen aufgefasst. Es werden zwei Herangehensweisen präsentiert: Als Erstes wird eine maximax-ähnliche Optimisierungsstrategie verfolgt, dabei werden die zu Grunde liegenden unbeobachtbaren Daten als Nuisance Parameter in das Regressionsmodell aufgenommen, was eine enge, aber auch über-optimistische Schätzung der Regressionskoeffizienten liefert. Zweitens wird ein Ansatz im Sinne der partiellen Identifikation angewendet, der per se schon vorsichtiger ist (als der vorherige), indem er nur die Menge aller möglichen Regressionskoeffizienten schätzt, die erhalten werden können, wenn die Schätzung auf jeder zu Grunde liegenden Datensituation durchgeführt wird. Unscharfe Daten haben gegenüber präzisen Daten den Vorteil, dass sie zusätzlich die Unsicherheit der einzelnen Beobachtungseinheit umfassen. Damit besitzen sie einen höheren Informationsgehalt. Allerdings gibt es zur Zeit nur wenige glaubwürdige statistische Modelle, die mit unscharfen Daten umgehen können. Von daher wird die Erhebung solcher Daten bei der Datenproduktion vernachlässigt, was dazu führt, dass entsprechende statistische Modelle ihr volles Potential nicht ausschöpfen können. Dies verhindert eine vollumfängliche Bewertung, wodurch wiederum die (Weiter-)Entwicklung jener Modelle gehemmt wird. Dies ist eine Variante des Henne-Ei-Problems. Diese Schrift will durch Vorschlag konkreter Methoden hinsichtlich des Umgangs mit unscharfen Daten in relevanten Anwendungssituationen Lösungswege aus der beschriebenen Situation aufzeigen und damit die entsprechende Datenproduktion anregen

    Statistical Issues in Machine Learning

    Get PDF
    Recursive partitioning methods from machine learning are being widely applied in many scientific fields such as, e.g., genetics and bioinformatics. The present work is concerned with the two main problems that arise in recursive partitioning, instability and biased variable selection, from a statistical point of view. With respect to the first issue, instability, the entire scope of methods from standard classification trees over robustified classification trees and ensemble methods such as TWIX, bagging and random forests is covered in this work. While ensemble methods prove to be much more stable than single trees, they also loose most of their interpretability. Therefore an adaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a single tree if the partition is sufficiently stable. With respect to the second issue, variable selection bias, the statistical sources of this artifact in single trees and a new form of bias inherent in ensemble methods based on bootstrap samples are investigated. For single trees, one unbiased split selection criterion is evaluated and another one newly introduced here. Based on the results for single trees and further findings on the effects of bootstrap sampling on association measures, it is shown that, in addition to using an unbiased split selection criterion, subsampling instead of bootstrap sampling should be employed in ensemble methods to be able to reliably compare the variable importance scores of predictor variables of different types. The statistical properties and the null hypothesis of a test for the random forest variable importance are critically investigated. Finally, a new, conditional importance measure is suggested that allows for a fair comparison in the case of correlated predictor variables and better reflects the null hypothesis of interest

    Small margin ensembles can be robust to class-label noise

    Full text link
    This is the author’s version of a work that was accepted for publication in Neurocomputing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neurocomputing, VOL 160 (2015) DOI 10.1016/j.neucom.2014.12.086Subsampling is used to generate bagging ensembles that are accurate and robust to class-label noise. The effect of using smaller bootstrap samples to train the base learners is to make the ensemble more diverse. As a result, the classification margins tend to decrease. In spite of having small margins, these ensembles can be robust to class-label noise. The validity of these observations is illustrated in a wide range of synthetic and real-world classification tasks. In the problems investigated, subsampling significantly outperforms standard bagging for different amounts of class-label noise. By contrast, the effectiveness of subsampling in random forest is problem dependent. In these types of ensembles the best overall accuracy is obtained when the random trees are built on bootstrap samples of the same size as the original training data. Nevertheless, subsampling becomes more effective as the amount of class-label noise increases.The authors acknowledge financial support from Spanish Plan Nacional I+D+i Grant TIN2013-42351-P and from Comunidad de Madrid Grant S2013/ICE-2845 CASI-CAM-CM

    Optimal Thresholds for Classification Trees using Nonparametric Predictive Inference

    Get PDF
    In data mining, classification is used to assign a new observation to one of a set of predefined classes based on the attributes of the observation. Classification trees are one of the most commonly used methods in the area of classification because their rules are easy to understand and interpret. Classification trees are constructed recursively by a top-down scheme using repeated splits of the training data set, which is a subset of the data. When the data set involves a continuous-valued attribute, there is a need to select an appropriate threshold value to determine the classes and split the data. In recent years, Nonparametric Predictive Inference (NPI) has been introduced for selecting optimal thresholds for two- and three-class classification problems, where the inferences are explicitly in terms of a given number of future observations and target proportions. These target proportions enable one to choose weights that reflect the relative importance of one class over another. The NPI-based threshold selection method has previously been implemented in the context of Receiver Operating Characteristic (ROC) analysis, but not for building classification trees. Due to the predictive nature of the NPI-based threshold selection method, it is well suited for the classification tree method, as the end goal of building classification trees is to use them for prediction as well. In this thesis, we present new classification algorithms for building classification trees using the NPI approach for selecting the optimal thresholds. We first present a new classification algorithm, which we call the NPI2-Tree algorithm, for building binary classification trees; we then extend it to build classification trees with three ordered classes, which we call the NPI3-Tree algorithm. In order to build classification trees using our algorithms, we introduce a new procedure for selecting the optimal values of target proportions by optimising classification performance on test data. We use different measures to evaluate and compare the performance of the NPI2-Tree and the NPI3-Tree classification algorithms with other classification algorithms from the literature. The experimental results show that our classification algorithms perform well compared to other algorithms. Finally, we present applications of the NPI2-Tree and NPI3-Tree classification algorithms on noisy data sets. Noise refers to situations that occur when the data sets used for classification tasks have incorrect values in the attribute variables or the class variable. The performances of the NPI2-Tree and NPI3-Tree classification algorithms in the case of noisy data are evaluated using different levels of noise added to the class variable. The results show that our classification algorithms perform well in case of noisy data and tend to be quite robust for most noise levels, compared to other classification algorithms

    The Node Selection Method for Split Attribute in C4.5 Algorithm Using the Coefficient of Determination Values for Multivariate Data Set

    Get PDF
    The split attribute in the decision tree algorithm, especially C4.5, has an important influence in producing a decision tree performance that has high predictive performance. This study aims to perform an attribute split in the C4.5 algorithm using the value of the termination coefficient (R2/R Square) which is combined with the aim of increasing the performance of the model performance produced by the C4.5 algorithm itself. The data used in this research are public datasets and private datasets. This study combines the C4.5 algorithm developed by Quinlan. The results in this study indicate that the use of the R2 value in the C4.5 algorithm has good performance in terms of accuracy and recall because three of the four datasets used have a higher value than the C4.5 algorithm without R2. Whereas in the aspect of precision, it has quite good performance because only two datasets have a higher value than the performance results of the algorithm without R2.    &nbsp
    corecore