30 research outputs found

    A System for Induction of Oblique Decision Trees

    Full text link
    This article describes a new system for induction of oblique decision trees. This system, OC1, combines deterministic hill-climbing with two forms of randomization to find a good oblique split (in the form of a hyperplane) at each node of a decision tree. Oblique decision tree methods are tuned especially for domains in which the attributes are numeric, although they can be adapted to symbolic or mixed symbolic/numeric attributes. We present extensive empirical studies, using both real and artificial data, that analyze OC1's ability to construct oblique trees that are smaller and more accurate than their axis-parallel counterparts. We also examine the benefits of randomization for the construction of oblique decision trees.Comment: See http://www.jair.org/ for an online appendix and other files accompanying this articl

    Differential Evolution Algorithm in the Construction of Interpretable Classification Models

    Get PDF
    In this chapter, the application of a differential evolution-based approach to induce oblique decision trees (DTs) is described. This type of decision trees uses a linear combination of attributes to build oblique hyperplanes dividing the instance space. Oblique decision trees are more compact and accurate than the traditional univariate decision trees. On the other hand, as differential evolution (DE) is an efficient evolutionary algorithm (EA) designed to solve optimization problems with real-valued parameters, and since finding an optimal hyperplane is a hard computing task, this metaheuristic (MH) is chosen to conduct an intelligent search of a near-optimal solution. Two methods are described in this chapter: one implementing a recursive partitioning strategy to find the most suitable oblique hyperplane of each internal node of a decision tree, and the other conducting a global search of a near-optimal oblique decision tree. A statistical analysis of the experimental results suggests that these methods show better performance as decision tree induction procedures in comparison with other supervised learning approaches

    Cluster Analysis of Panel Datasets using Non-Standard Optimisation of Information Criteria

    Get PDF
    Panel datasets have been increasingly used in economics to analyse complex economic phenomena. One of the attractions of panel datasets is the ability to use an extended dataset to obtain information about parameters of interest which are assumed to have common values across panel units. However, the assumption of poolability has not been studied extensively beyond tests that determine whether a given dataset is poolable. We propose an information criterion method that enables the distinction of a set of series into a set of poolable series for which the hypothesis of a common parameter subvector cannot be reject and a set of series for which the poolability hypothesis fails. The method can be extended to analyse datasets with multiple clusters of series with similar characteristics. We discuss the theoretical properties of the method and investigate its small sample performance in a Monte Carlo study.Panel datasets, Poolability, Information criteria, Genetic Algorithm, Simulated Annealing

    evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R

    Get PDF
    Commonly used classification and regression tree methods like the CART algorithm are recursive partitioning methods that build the model in a forward stepwise search. Although this approach is known to be an efficient heuristic, the results of recursive tree methods are only locally optimal, as splits are chosen to maximize homogeneity at the next step only. An alternative way to search over the parameter space of trees is to use global optimization methods like evolutionary algorithms. This paper describes the "evtree" package, which implements an evolutionary algorithm for learning globally optimal classification and regression trees in R. Computationally intensive tasks are fully computed in C++ while the "partykit" (Hothorn and Zeileis 2011) package is leveraged for representing the resulting trees in R, providing unified infrastructure for summaries, visualizations, and predictions. "evtree" is compared to "rpart" (Therneau and Atkinson 1997), the open-source CART implementation, and conditional inference trees ("ctree", Hothorn, Hornik, and Zeileis 2006). The usefulness of "evtree" is illustrated in a textbook customer classification task and a benchmark study of predictive accuracy in which "evtree" achieved at least similar and most of the time better results compared to the recursive algorithms "rpart" and "ctree".machine learning, classification trees, regression trees, evolutionary algorithms, R

    Fisher’s decision tree

    Get PDF
    Univariate decision trees are classifiers currently used in many data mining applications. This classifier discovers partitions in the input space via hyperplanes that are orthogonal to the axes of attributes, producing a model that can be understood by human experts. One disadvantage of univariate decision trees is that they produce complex and inaccurate models when decision boundaries are not orthogonal to axes. In this paper we introduce the Fisher’s Tree, it is a classifier that takes advantage of dimensionality reduction of Fisher’s linear discriminant and uses the decomposition strategy of decision trees, to come up with an oblique decision tree. Our proposal generates an artificial attribute that is used to split the data in a recursive way. The Fisher’s decision tree induces oblique trees whose accuracy, size, number of leaves and training time are competitive with respect to other decision trees reported in the literature. We use more than ten public available data sets to demonstrate the effectiveness of our method

    SPODT: An R Package to Perform Spatial Partitioning

    Get PDF
    International audienceSpatial cluster detection is a classical question in epidemiology: Are cases located near other cases? In order to classify a study area into zones of different risks and determine their boundaries, we have developed a spatial partitioning method based on oblique decision trees, which is called spatial oblique decision tree (SpODT). This non-parametric method is based on the classification and regression tree (CART) approach introduced by Leo Breiman. Applied to epidemiological spatial data, the algorithm recursively searches among the coordinates for a threshold or a boundary between zones, so that the risks estimated in these zones are as different as possible. While the CART algorithm leads to rectangular zones, providing perpendicular splits of longitudes and latitudes, the SpODT algorithm provides oblique splitting of the study area, which is more appropriate and accurate for spatial epidemiology. Oblique decision trees can be considered as non-parametric regression models. Beyond the basic function, we have developed a set of functions that enable extended analyses of spatial data, providing: inference, graphical representations, spatio-temporal analysis, adjustments on covariates, spatial weighted partition, and the gathering of similar adjacent final classes. In this paper, we propose a new R package, SPODT, which provides an extensible set of functions for partitioning spatial and spatio-temporal data. The implementation and extensions of the algorithm are described. Function usage examples are proposed, looking for clustering malaria episodes in Bandiagara, Mali, and samples showing three different cluster shapes

    Oblique decision trees for spatial pattern detection: optimal algorithm and application to malaria risk

    Get PDF
    BACKGROUND: In order to detect potential disease clusters where a putative source cannot be specified, classical procedures scan the geographical area with circular windows through a specified grid imposed to the map. However, the choice of the windows' shapes, sizes and centers is critical and different choices may not provide exactly the same results. The aim of our work was to use an Oblique Decision Tree model (ODT) which provides potential clusters without pre-specifying shapes, sizes or centers. For this purpose, we have developed an ODT-algorithm to find an oblique partition of the space defined by the geographic coordinates. METHODS: ODT is based on the classification and regression tree (CART). As CART finds out rectangular partitions of the covariate space, ODT provides oblique partitions maximizing the interclass variance of the independent variable. Since it is a NP-Hard problem in R(N), classical ODT-algorithms use evolutionary procedures or heuristics. We have developed an optimal ODT-algorithm in R(2), based on the directions defined by each couple of point locations. This partition provided potential clusters which can be tested with Monte-Carlo inference. We applied the ODT-model to a dataset in order to identify potential high risk clusters of malaria in a village in Western Africa during the dry season. The ODT results were compared with those of the Kulldorff' s SaTScan™. RESULTS: The ODT procedure provided four classes of risk of infection. In the first high risk class 60%, 95% confidence interval (CI95%) [52.22–67.55], of the children was infected. Monte-Carlo inference showed that the spatial pattern issued from the ODT-model was significant (p < 0.0001). Satscan results yielded one significant cluster where the risk of disease was high with an infectious rate of 54.21%, CI95% [47.51–60.75]. Obviously, his center was located within the first high risk ODT class. Both procedures provided similar results identifying a high risk cluster in the western part of the village where a mosquito breeding point was located. CONCLUSION: ODT-models improve the classical scanning procedures by detecting potential disease clusters independently of any specification of the shapes, sizes or centers of the clusters

    Oblique Decision Tree Algorithm with Minority Condensation for Class Imbalanced Problem

    Get PDF
    In recent years, a significant issue in classification is to handle a dataset containing imbalanced number of instances in each class. Classifier modification is one of the well-known techniques to deal with this particular issue. In this paper, the effective classification model based on an oblique decision tree is enhanced to work with the imbalanced datasets that is called oblique minority condensed decision tree (OMCT). Initially, it selects the best axis-parallel hyperplane based on decision tree algorithm using the minority entropy of instances within the minority inner fence selection. Then it perturbs this hyperplane along each axis to improve its minority entropy. Finally, it stochastically perturbs this hyperplane to escape the local solution. From the experimental results, OMCT significantly outperforms 6 state-of-the-art decision tree algorithms that are CART, C4.5, OC1, AE, DCSM and ME on 18 real-world datasets from UCI in term of precision, recall and F1 score. Moreover, the size of decision tree from OMCT is significantly smaller than others
    corecore