971 research outputs found

    Hyper-heuristic decision tree induction

    Get PDF
    A hyper-heuristic is any algorithm that searches or operates in the space of heuristics as opposed to the space of solutions. Hyper-heuristics are increasingly used in function and combinatorial optimization. Rather than attempt to solve a problem using a fixed heuristic, a hyper-heuristic approach attempts to find a combination of heuristics that solve a problem (and in turn may be directly suitable for a class of problem instances). Hyper-heuristics have been little explored in data mining. This work presents novel hyper-heuristic approaches to data mining, by searching a space of attribute selection criteria for decision tree building algorithm. The search is conducted by a genetic algorithm. The result of the hyper-heuristic search in this case is a strategy for selecting attributes while building decision trees. Most hyper-heuristics work by trying to adapt the heuristic to the state of the problem being solved. Our hyper-heuristic is no different. It employs a strategy for adapting the heuristic used to build decision tree nodes according to some set of features of the training set it is working on. We introduce, explore and evaluate five different ways in which this problem state can be represented for a hyper-heuristic that operates within a decisiontree building algorithm. In each case, the hyper-heuristic is guided by a rule set that tries to map features of the data set to be split by the decision tree building algorithm to a heuristic to be used for splitting the same data set. We also explore and evaluate three different sets of low-level heuristics that could be employed by such a hyper-heuristic. This work also makes a distinction between specialist hyper-heuristics and generalist hyper-heuristics. The main difference between these two hyperheuristcs is the number of training sets used by the hyper-heuristic genetic algorithm. Specialist hyper-heuristics are created using a single data set from a particular domain for evolving the hyper-heurisic rule set. Such algorithms are expected to outperform standard algorithms on the kind of data set used by the hyper-heuristic genetic algorithm. Generalist hyper-heuristics are trained on multiple data sets from different domains and are expected to deliver a robust and competitive performance over these data sets when compared to standard algorithms. We evaluate both approaches for each kind of hyper-heuristic presented in this thesis. We use both real data sets as well as synthetic data sets. Our results suggest that none of the hyper-heuristics presented in this work are suited for specialization – in most cases, the hyper-heuristic’s performance on the data set it was specialized for was not significantly better than that of the best performing standard algorithm. On the other hand, the generalist hyper-heuristics delivered results that were very competitive to the best standard methods. In some cases we even achieved a significantly better overall performance than all of the standard methods

    Introduction in IND and recursive partitioning

    Get PDF
    This manual describes the IND package for learning tree classifiers from data. The package is an integrated C and C shell re-implementation of tree learning routines such as CART, C4, and various MDL and Bayesian variations. The package includes routines for experiment control, interactive operation, and analysis of tree building. The manual introduces the system and its many options, gives a basic review of tree learning, contains a guide to the literature and a glossary, and lists the manual pages for the routines and instructions on installation

    Winsorize tree algorithm for handling outliers in classification problem

    Get PDF
    Classification and Regression Tree (CART) is designed to predict or classify the objects in the predetermined classes from a set of predictors. However, having outliers could affect the structures of CART, purity and predictive accuracy in classification. Some researchers opt to perform pre-pruning or post-pruning of the CART in handling the outliers. This study proposes a modified classification tree algorithm called Winsorize tree based on the distribution of classes in the training dataset. The Winsorize tree investigates all possible outliers from node to node before checking the potential splitting point to gain the node with the highest purity of the nodes. The upper fence and lower fence of a boxplot are used to detect potential outliers whose values exceeding the tail of Q ± (1.5×Interquartile range). The identified outliers are neutralized using the Winsorize method whilst the Winsorize Gini index is then used to compute the divergences among probability distributions of the target predictor’s values until stopping criteria are met. This study uses three stopping rules: node achieved the minimum 10% of total training set

    Introduction to IND and recursive partitioning, version 1.0

    Get PDF
    This manual describes the IND package for learning tree classifiers from data. The package is an integrated C and C shell re-implementation of tree learning routines such as CART, C4, and various MDL and Bayesian variations. The package includes routines for experiment control, interactive operation, and analysis of tree building. The manual introduces the system and its many options, gives a basic review of tree learning, contains a guide to the literature and a glossary, lists the manual pages for the routines, and instructions on installation

    Random Forests : An Application To Tumour Classification

    Get PDF
    In this thesis, machine learning approaches, namely decision trees and random forests, are discussed. A mathematical foundation of decision trees is given. It is followed by discussion of the advantages and disadvantages of them. Further, the application of decision trees as a part of random forests is presented. A real life study of brain tumours is discussed regarding usage of random forests. The data consists of six different types of brain tumours, and the data is acquired by Raman spectroscopy. After the data has been curated, a random forest model is utilised to classify the class of the tumour. At the current point, the results seem optimistic, but require further experimentation

    Customer retention

    Get PDF
    A research report submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in partial fulfillment of the requirements for the degree of Master of Science in Engineering. Johannesburg, May 2018The aim of this study is to model the probability of a customer to attrite/defect from a bank where, for example, the bank is not their preferred/primary bank for salary deposits. The termination of deposit inflow serves as the outcome parameter and the random forest modelling technique was used to predict the outcome, in which new data sources (transactional data) were explored to add predictive power. The conventional logistic regression modelling technique was used to benchmark the random forest’s results. It was found that the random forest model slightly overfit during the training process and loses predictive power during validation and out of training period data. The random forest model, however, remains predictive and performs better than logistic regression at a cut-off probability of 20%.MT 201

    PIDT: A Novel Decision Tree Algorithm Based on Parameterised Impurities and Statistical Pruning Approaches

    Get PDF
    In the process of constructing a decision tree, the criteria for selecting the splitting attributes influence the performance of the model produced by the decision tree algorithm. The most well-known criteria such as Shannon entropy and Gini index, suffer from the lack of adaptability to the datasets. This paper presents novel splitting attribute selection criteria based on some families of pa-rameterised impurities that we proposed here to be used in the construction of optimal decision trees. These criteria rely on families of strict concave functions that define the new generalised parameterised impurity measures which we ap-plied in devising and implementing our PIDT novel decision tree algorithm. This paper proposes also the S-condition based on statistical permutation tests, whose purpose is to ensure that the reduction in impurity, or gain, for the selected attrib-ute is statistically significant. We implemented the S-pruning procedure based on the S-condition, to prevent model overfitting. These methods were evaluated on a number of simulated and benchmark datasets. Experimental results suggest that by tuning the parameters of the impurity measures and by using our S-pruning method, we obtain better decision tree classifiers with the PIDT algorithm

    Multivariate classification with random forests for gravitational wave searches of black hole binary coalescence

    Get PDF
    Searches for gravitational waves produced by coalescing black hole binaries with total masses ≳25  M_⊙ use matched filtering with templates of short duration. Non-Gaussian noise bursts in gravitational wave detector data can mimic short signals and limit the sensitivity of these searches. Previous searches have relied on empirically designed statistics incorporating signal-to-noise ratio and signal-based vetoes to separate gravitational wave candidates from noise candidates. We report on sensitivity improvements achieved using a multivariate candidate ranking statistic derived from a supervised machine learning algorithm. We apply the random forest of bagged decision trees technique to two separate searches in the high mass (≳25  M_⊙) parameter space. For a search which is sensitive to gravitational waves from the inspiral, merger, and ringdown of binary black holes with total mass between 25  M_⊙ and 100  M_⊙, we find sensitive volume improvements as high as 70_(±13)%–109_(±11)% when compared to the previously used ranking statistic. For a ringdown-only search which is sensitive to gravitational waves from the resultant perturbed intermediate mass black hole with mass roughly between 10  M_⊙ and 600  M_⊙, we find sensitive volume improvements as high as 61_(±4)%–241_(±12)% when compared to the previously used ranking statistic. We also report how sensitivity improvements can differ depending on mass regime, mass ratio, and available data quality information. Finally, we describe the techniques used to tune and train the random forest classifier that can be generalized to its use in other searches for gravitational waves
    • …
    corecore