3,492 research outputs found

    Unbiased split selection for classification trees based on the Gini Index

    Get PDF
    The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting p-value as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation- and real data- studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy criterion

    Some Remarks about the Usage of Asymmetric Correlation Measurements for the Induction of Decision Trees

    Get PDF
    Decision trees are used very successfully for the identification resp. classification task of objects in many domains like marketing (e.g. Decker, Temme (2001)) or medicine. Other procedures to classify objects are for instance the logistic regression, the logit- or probit analysis, the linear or squared discriminant analysis, the nearest neighbour procedure or some kernel density estimators. The common aim of all these classification procedures is to generate classification rules which describe the correlation between some independent exogenous variables resp. attributes and at least one endogenous variable, the so called class membership variable. --

    Hyper-heuristic decision tree induction

    Get PDF
    A hyper-heuristic is any algorithm that searches or operates in the space of heuristics as opposed to the space of solutions. Hyper-heuristics are increasingly used in function and combinatorial optimization. Rather than attempt to solve a problem using a fixed heuristic, a hyper-heuristic approach attempts to find a combination of heuristics that solve a problem (and in turn may be directly suitable for a class of problem instances). Hyper-heuristics have been little explored in data mining. This work presents novel hyper-heuristic approaches to data mining, by searching a space of attribute selection criteria for decision tree building algorithm. The search is conducted by a genetic algorithm. The result of the hyper-heuristic search in this case is a strategy for selecting attributes while building decision trees. Most hyper-heuristics work by trying to adapt the heuristic to the state of the problem being solved. Our hyper-heuristic is no different. It employs a strategy for adapting the heuristic used to build decision tree nodes according to some set of features of the training set it is working on. We introduce, explore and evaluate five different ways in which this problem state can be represented for a hyper-heuristic that operates within a decisiontree building algorithm. In each case, the hyper-heuristic is guided by a rule set that tries to map features of the data set to be split by the decision tree building algorithm to a heuristic to be used for splitting the same data set. We also explore and evaluate three different sets of low-level heuristics that could be employed by such a hyper-heuristic. This work also makes a distinction between specialist hyper-heuristics and generalist hyper-heuristics. The main difference between these two hyperheuristcs is the number of training sets used by the hyper-heuristic genetic algorithm. Specialist hyper-heuristics are created using a single data set from a particular domain for evolving the hyper-heurisic rule set. Such algorithms are expected to outperform standard algorithms on the kind of data set used by the hyper-heuristic genetic algorithm. Generalist hyper-heuristics are trained on multiple data sets from different domains and are expected to deliver a robust and competitive performance over these data sets when compared to standard algorithms. We evaluate both approaches for each kind of hyper-heuristic presented in this thesis. We use both real data sets as well as synthetic data sets. Our results suggest that none of the hyper-heuristics presented in this work are suited for specialization – in most cases, the hyper-heuristic’s performance on the data set it was specialized for was not significantly better than that of the best performing standard algorithm. On the other hand, the generalist hyper-heuristics delivered results that were very competitive to the best standard methods. In some cases we even achieved a significantly better overall performance than all of the standard methods

    Contributions to reasoning on imprecise data

    Get PDF
    This thesis contains four contributions which advocate cautious statistical modelling and inference. They achieve it by taking sets of models into account, either directly or indirectly by looking at compatible data situations. Special care is taken to avoid assumptions which are technically convenient, but reduce the uncertainty involved in an unjustified manner. This thesis provides methods for cautious statistical modelling and inference, which are able to exhaust the potential of precise and vague data, motivated by different fields of application, ranging from political science to official statistics. At first, the inherently imprecise Nonparametric Predictive Inference model is involved in the cautious selection of splitting variables in the construction of imprecise classification trees, which are able to describe a structure and allow for a reasonably high predictive power. Dependent on the interpretation of vagueness, different strategies for vague data are then discussed in terms of finite random closed sets: On the one hand, the data to be analysed are regarded as set-valued answers of an item in a questionnaire, where each possible answer corresponding to a subset of the sample space is interpreted as a separate entity. By this the finite random set is reduced to an (ordinary) random variable on a transformed sample space. The context of application is the analysis of voting intentions, where it is shown that the presented approach is able to characterise the undecided in a more detailed way, which common approaches are not able to. Altough the presented analysis, regarded as a first step, is carried out on set-valued data, which are suitably self-constructed with respect to the scientific research question, it still clearly demonstrates that the full potential of this quite general framework is not exhausted. It is capable of dealing with more complex applications. On the other hand, the vague data are produced by set-valued single imputation (imprecise imputation) where the finite random sets are interpreted as being the result of some (unspecified) coarsening. The approach is presented within the context of statistical matching, which is used to gain joint knowledge on features that were not jointly collected in the initial data production. This is especially relevant in data production, e.g. in official statistics, as it allows to fuse the information of already accessible data sets into a new one, without the requirement of actual data collection in the field. Finally, in order to share data, they need to be suitably anonymised. For the specific class of anonymisation techniques of microaggregation, its ability to infer on generalised linear regression models is evaluated. Therefore, the microaggregated data are regarded as a set of compatible, unobserved underlying data situations. Two strategies to follow are proposed. At first, a maximax-like optimisation strategy is pursued, in which the underlying unobserved data are incorporated into the regression model as nuisance parameters, providing a concise yet over-optimistic estimation of the regression coefficients. Secondly, an approach in terms of partial identification, which is inherently more cautious than the previous one, is applied to estimate the set of all regression coefficients that are obtained by performing the estimation on each compatible data situation. Vague data are deemed favourable to precise data as they additionally encompass the uncertainty of the individual observation, and therefore they have a higher informational value. However, to the present day, there are few (credible) statistical models that are able to deal with vague or set-valued data. For this reason, the collection of such data is neglected in data production, disallowing such models to exhaust their full potential. This in turn prevents a throughout evaluation, negatively affecting the (further) development of such models. This situation is a variant of the chicken or egg dilemma. The ambition of this thesis is to break this cycle by providing actual methods for dealing with vague data in relevant situations in practice, to stimulate the required data production.Diese Schrift setzt sich in vier Beiträgen für eine vorsichtige statistische Modellierung und Inferenz ein. Dieses wird erreicht, indem man Mengen von Modellen betrachtet, entweder direkt oder indirekt über die Interpretation der Daten als Menge zugrunde liegender Datensituationen. Besonderer Wert wird dabei darauf gelegt, Annahmen zu vermeiden, die zwar technisch bequem sind, aber die zugrunde liegende Unsicherheit der Daten in ungerechtfertigter Weise reduzieren. In dieser Schrift werden verschiedene Methoden der vorsichtigen Modellierung und Inferenz vorgeschlagen, die das Potential von präzisen und unscharfen Daten ausschöpfen können, angeregt von unterschiedlichen Anwendungsbereichen, die von Politikwissenschaften bis zur amtlichen Statistik reichen. Zuerst wird das Modell der Nonparametrischen Prädiktiven Inferenz, welches per se unscharf ist, in der vorsichtigen Auswahl von Split-Variablen bei der Erstellung von Klassifikationsbäumen verwendet, die auf Methoden der Imprecise Probabilities fußen. Diese Bäume zeichnen sich dadurch aus, dass sie sowohl eine Struktur beschreiben, als auch eine annehmbar hohe Prädiktionsgüte aufweisen. In Abhängigkeit von der Interpretation der Unschärfe, werden dann verschiedene Strategien für den Umgang mit unscharfen Daten im Rahmen von finiten Random Sets erörtert. Einerseits werden die zu analysierenden Daten als mengenwertige Antwort auf eine Frage in einer Fragebogen aufgefasst. Hierbei wird jede mögliche (multiple) Antwort, die eine Teilmenge des Stichprobenraumes darstellt, als eigenständige Entität betrachtet. Somit werden die finiten Random Sets auf (gewöhnliche) Zufallsvariablen reduziert, die nun in einen transformierten Raum abbilden. Im Rahmen einer Analyse von Wahlabsichten hat der vorgeschlagene Ansatz gezeigt, dass die Unentschlossenen mit ihm genauer charakterisiert werden können, als es mit den gängigen Methoden möglich ist. Obwohl die vorgestellte Analyse, betrachtet als ein erster Schritt, auf mengenwertige Daten angewendet wird, die vor dem Hintergrund der wissenschaftlichen Forschungsfrage in geeigneter Weise selbst konstruiert worden sind, zeigt diese dennoch klar, dass die Möglichkeiten dieses generellen Ansatzes nicht ausgeschöpft sind, so dass er auch in komplexeren Situationen angewendet werden kann. Andererseits werden unscharfe Daten durch eine mengenwertige Einfachimputation (imprecise imputation) erzeugt. Hier werden die finiten Random Sets als Ergebnis einer (unspezifizierten) Vergröberung interpretiert. Der Ansatz wird im Rahmen des Statistischen Matchings vorgeschlagen, das verwendet wird, um gemeinsame Informationen über ursprünglich nicht zusammen erhobene Merkmale zur erhalten. Dieses ist insbesondere relevant bei der Datenproduktion, beispielsweise in der amtlichen Statistik, weil es erlaubt, die verschiedenartigen Informationen aus unterschiedlichen bereits vorhandenen Datensätzen zu einen neuen Datensatz zu verschmelzen, ohne dass dafür tatsächlich Daten neu erhoben werden müssen. Zudem müssen die Daten für den Datenaustausch in geeigneter Weise anonymisiert sein. Für die spezielle Klasse der Anonymisierungstechnik der Mikroaggregation wird ihre Eignung im Hinblick auf die Verwendbarkeit in generalisierten linearen Regressionsmodellen geprüft. Hierfür werden die mikroaggregierten Daten als eine Menge von möglichen, unbeobachtbaren zu Grunde liegenden Datensituationen aufgefasst. Es werden zwei Herangehensweisen präsentiert: Als Erstes wird eine maximax-ähnliche Optimisierungsstrategie verfolgt, dabei werden die zu Grunde liegenden unbeobachtbaren Daten als Nuisance Parameter in das Regressionsmodell aufgenommen, was eine enge, aber auch über-optimistische Schätzung der Regressionskoeffizienten liefert. Zweitens wird ein Ansatz im Sinne der partiellen Identifikation angewendet, der per se schon vorsichtiger ist (als der vorherige), indem er nur die Menge aller möglichen Regressionskoeffizienten schätzt, die erhalten werden können, wenn die Schätzung auf jeder zu Grunde liegenden Datensituation durchgeführt wird. Unscharfe Daten haben gegenüber präzisen Daten den Vorteil, dass sie zusätzlich die Unsicherheit der einzelnen Beobachtungseinheit umfassen. Damit besitzen sie einen höheren Informationsgehalt. Allerdings gibt es zur Zeit nur wenige glaubwürdige statistische Modelle, die mit unscharfen Daten umgehen können. Von daher wird die Erhebung solcher Daten bei der Datenproduktion vernachlässigt, was dazu führt, dass entsprechende statistische Modelle ihr volles Potential nicht ausschöpfen können. Dies verhindert eine vollumfängliche Bewertung, wodurch wiederum die (Weiter-)Entwicklung jener Modelle gehemmt wird. Dies ist eine Variante des Henne-Ei-Problems. Diese Schrift will durch Vorschlag konkreter Methoden hinsichtlich des Umgangs mit unscharfen Daten in relevanten Anwendungssituationen Lösungswege aus der beschriebenen Situation aufzeigen und damit die entsprechende Datenproduktion anregen

    Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression

    Full text link
    Random Forests (Breiman, 2001) is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint conditional distribution, independent of the estimation target and the data model. It uses a new splitting criterion based on the MMD distributional metric, which is suitable for detecting heterogeneity in multivariate distributions. The induced weights define an estimate of the full conditional distribution, which in turn can be used for arbitrary and potentially complicated targets of interest. The method is very versatile and convenient to use, as we illustrate on a wide range of examples. The code is available as Python and R packages drf
    corecore