75,465 research outputs found

    A survey of cost-sensitive decision tree induction algorithms

    Get PDF
    The past decade has seen a significant interest on the problem of inducing decision trees that take account of costs of misclassification and costs of acquiring the features used for decision making. This survey identifies over 50 algorithms including approaches that are direct adaptations of accuracy based methods, use genetic algorithms, use anytime methods and utilize boosting and bagging. The survey brings together these different studies and novel approaches to cost-sensitive decision tree learning, provides a useful taxonomy, a historical timeline of how the field has developed and should provide a useful reference point for future research in this field

    Integrating Learning from Examples into the Search for Diagnostic Policies

    Full text link
    This paper studies the problem of learning diagnostic policies from training examples. A diagnostic policy is a complete description of the decision-making actions of a diagnostician (i.e., tests followed by a diagnostic decision) for all possible combinations of test results. An optimal diagnostic policy is one that minimizes the expected total cost, which is the sum of measurement costs and misdiagnosis costs. In most diagnostic settings, there is a tradeoff between these two kinds of costs. This paper formalizes diagnostic decision making as a Markov Decision Process (MDP). The paper introduces a new family of systematic search algorithms based on the AO* algorithm to solve this MDP. To make AO* efficient, the paper describes an admissible heuristic that enables AO* to prune large parts of the search space. The paper also introduces several greedy algorithms including some improvements over previously-published methods. The paper then addresses the question of learning diagnostic policies from examples. When the probabilities of diseases and test results are computed from training data, there is a great danger of overfitting. To reduce overfitting, regularizers are integrated into the search algorithms. Finally, the paper compares the proposed methods on five benchmark diagnostic data sets. The studies show that in most cases the systematic search methods produce better diagnostic policies than the greedy methods. In addition, the studies show that for training sets of realistic size, the systematic search algorithms are practical on todays desktop computers

    Cost-Sensitive Decision Tree with Multiple Resource Constraints

    Get PDF
    Resource constraints are commonly found in classification tasks. For example, there could be a budget limit on implementation and a deadline for finishing the classification task. Applying the top-down approach for tree induction in this situation may have significant drawbacks. In particular, it is difficult, especially in an early stage of tree induction, to assess an attribute’s contribution to improving the total implementation cost and its impact on attribute selection in later stages because of the deadline constraint. To address this problem, we propose an innovative algorithm, namely, the Cost-Sensitive Associative Tree (CAT) algorithm. Essentially, the algorithm first extracts and retains association classification rules from the training data which satisfy resource constraints, and then uses the rules to construct the final decision tree. The approach has advantages over the traditional top-down approach, first because only feasible classification rules are considered in the tree induction and, second, because their costs and resource use are known. In contrast, in the top-down approach, the information is not available for selecting splitting attributes. The experiment results show that the CAT algorithm significantly outperforms the top-down approach and adapts very well to available resources.Cost-sensitive learning, mining methods and algorithms, decision trees

    Population management of cone and seed insects in spruce seed orchards

    Get PDF
    Seed orchards have been established in order to produce high quality seeds for reforestation and forestation. However, seed production in spruce (Picea abies (L.) Karst.) seed orchards is severely hampered by cone- and seed-feeding insects. Therefore it is of great importance to find methods to reduce damages from insects. This thesis summarizes and discusses results presented in four papers concerning various methods and chemicals (insecticides and a pheromone) for damage reductions in spruce seed orchards. Area-wide application of the biological insecticide Turex 50 WP was shown to reduce damage by two of the four most serious pest species. Concerns were then raised that feeding by insects that are not affected by this insecticide may increase following its application, in response to the consequent increases in the availability of food and space, resulting in little no or difference in overall damage. A follow up study indicated that there would probably not be any problem with increased feeding by the larvae survived and that spraying of an insecticide not affecting all species would probably be cost effective. However, various species-related and abiotic factors (e.g. rain and temperature) affect the efficacy of insecticide treatments, both among and within years, and thus should be taken into account. A system that would be less sensitive to weather and also may affect all pest species and at the same time avoid affecting the surrounding environment is injectable systemic insecticides. In order to increase the cost efficiency a study was performed where insecticide was combined with the flower stimulating hormone gibberellin and successfully reduced damages and increased number of flowers. In order to know if and when an insecticide application should be carried out, pheromone for trapping insects is a useful tool. But in order to do so there must be a pheromone available. During the spring of 2009 a pheromone for C. strobilella was identified and synthesized. The study showed that the amount of pheromone released from the female was extremely low, 1 pg, so the male antenna is supersensitive in order to find females. This implies also that this species can be a good candidate for mating disruption

    Multiple costs and their combination in cost sensitive learning

    Full text link
    University of Technology, Sydney. Faculty of Information Technology.Cost sensitive learning is firstly defined as a procedure of minimizing the costs of classification errors. It has attracted much attention in the last few years. Being cost sensitive has the strength to handle the unbalance on the misclassification errors in some real world applications. Recently, researchers have considered how to deal with two or more costs in a model, such as involving both of the misclassification costs (the cost for misclassification errors) and attribute test costs (the cost incurs as obtaining the attribute’s value) [Tur95, GGR02, LYWZ04], Cost sensitive learning involving both attribute test costs and misclassification costs is called test cost sensitive learning that is more close to real industry focus, such as medical research and business decision. Current test cost sensitive learning aims to find an optimal diagnostic policy (simply, a policy) with minimal expected sum of the misclassification cost and test cost that specifies, for example which attribute test is performed in next step based on the outcomes of previous attribute tests, and when the algorithm stops (by choosing to classify). A diagnostic policy takes the form of a decision tree whose nodes specify tests and whose leaves specify classification actions. A challenging issue is the choice of a reasonable one from all possible policies. This dissertation argues for considering both of the test cost and misclassification cost, or even more costs together, but doubts if the current way, summing up the two costs, is the only right way. Detailed studies are needed to ensure the ways of combination make sense and be “correct”, dimensionally as well as semantically. This dissertation studies fundamental properties of costs involved and designs new models to combine the costs together. Some essential properties of attribute test cost are studied. In our learning problem definition, test cost is combined into misclassification cost by choosing and performing proper tests for a better decision. Why do you choose them and how about the ones that are not chosen? Very often, only part of all attribute values are enough for making a decision and rest attributes are left as “unknown”. The values are defined as ‘absent values' as they are left as unknown purposely for some rational reasons when the information obtained is considered as enough, or when patients have no money enough to perform further tests, and so on.. This is the first work to utilize the information hidden in those “absent values” in cost sensitive learning; and the conclusion is very positive, i.e. “Absent data” is useful for decision making. The “absent values” are usually treated as ‘missing values' when left as known for unexpected reasons. This thesis studies the difference between ‘absent’ and ‘missing’. An algorithm based on lazy decision tree is proposed to identify the absent data from missing data, and a novel strategy is proposed to help patch the “real” missing values. . Two novel test cost sensitive models are designed for different real work scenarios. The first model is a general test cost sensitive learning framework with multiple cost scales. Previous works assume that the test cost and the misclassification cost must be defined on the same cost scale, such as the dollar cost incurred in a medical diagnosis. And they aim to minimize the sum of the misclassification cost and the test cost. However, costs may be measured in very different units and we may meet difficulty in defining the multiple costs on the same cost scale. It is not only a technology issue, but also a social issue. In medical diagnosis, how much money should you assign for a misclassification cost? Sometimes, a misclassification may hurt a patient’s life. And from a social point of view, life is invaluable. To tackle this issue, a target-resource budget learning framework with multiple costs is proposed. With this framework, we present a test cost sensitive decision tree model with two kinds of cost scales. The task is to minimize one cost scale, called target cost, and keep the other one within specified budgets. To the best of our knowledge, this is the first attempt to study the cost sensitive learning with multiple costs scales. The second model is based on the assumption that some attributes of an unlabeled example are known before being classified. A test cost sensitive lazy tree model is proposed to utilize the known information to reduce the overall cost. We also modify and apply this model to the batch-test problem: multiple tests are chosen and done in one shot, rather than in a sequential manner in the test-sensitive tree. It is significant in some diagnosis applications that require a decision to be made as soon as possible, such as emergency treatment. Extensive experiments are conducted for evaluating the proposed approaches, and demonstrate that the work in this dissertation is efficient and useful for many diagnostic tasks involving target cost minimization and resource utilization for obtaining missing information

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
    • 

    corecore