234,813 research outputs found

    Multiple costs and their combination in cost sensitive learning

    Full text link
    University of Technology, Sydney. Faculty of Information Technology.Cost sensitive learning is firstly defined as a procedure of minimizing the costs of classification errors. It has attracted much attention in the last few years. Being cost sensitive has the strength to handle the unbalance on the misclassification errors in some real world applications. Recently, researchers have considered how to deal with two or more costs in a model, such as involving both of the misclassification costs (the cost for misclassification errors) and attribute test costs (the cost incurs as obtaining the attribute’s value) [Tur95, GGR02, LYWZ04], Cost sensitive learning involving both attribute test costs and misclassification costs is called test cost sensitive learning that is more close to real industry focus, such as medical research and business decision. Current test cost sensitive learning aims to find an optimal diagnostic policy (simply, a policy) with minimal expected sum of the misclassification cost and test cost that specifies, for example which attribute test is performed in next step based on the outcomes of previous attribute tests, and when the algorithm stops (by choosing to classify). A diagnostic policy takes the form of a decision tree whose nodes specify tests and whose leaves specify classification actions. A challenging issue is the choice of a reasonable one from all possible policies. This dissertation argues for considering both of the test cost and misclassification cost, or even more costs together, but doubts if the current way, summing up the two costs, is the only right way. Detailed studies are needed to ensure the ways of combination make sense and be “correct”, dimensionally as well as semantically. This dissertation studies fundamental properties of costs involved and designs new models to combine the costs together. Some essential properties of attribute test cost are studied. In our learning problem definition, test cost is combined into misclassification cost by choosing and performing proper tests for a better decision. Why do you choose them and how about the ones that are not chosen? Very often, only part of all attribute values are enough for making a decision and rest attributes are left as “unknown”. The values are defined as ‘absent values' as they are left as unknown purposely for some rational reasons when the information obtained is considered as enough, or when patients have no money enough to perform further tests, and so on.. This is the first work to utilize the information hidden in those “absent values” in cost sensitive learning; and the conclusion is very positive, i.e. “Absent data” is useful for decision making. The “absent values” are usually treated as ‘missing values' when left as known for unexpected reasons. This thesis studies the difference between ‘absent’ and ‘missing’. An algorithm based on lazy decision tree is proposed to identify the absent data from missing data, and a novel strategy is proposed to help patch the “real” missing values. . Two novel test cost sensitive models are designed for different real work scenarios. The first model is a general test cost sensitive learning framework with multiple cost scales. Previous works assume that the test cost and the misclassification cost must be defined on the same cost scale, such as the dollar cost incurred in a medical diagnosis. And they aim to minimize the sum of the misclassification cost and the test cost. However, costs may be measured in very different units and we may meet difficulty in defining the multiple costs on the same cost scale. It is not only a technology issue, but also a social issue. In medical diagnosis, how much money should you assign for a misclassification cost? Sometimes, a misclassification may hurt a patient’s life. And from a social point of view, life is invaluable. To tackle this issue, a target-resource budget learning framework with multiple costs is proposed. With this framework, we present a test cost sensitive decision tree model with two kinds of cost scales. The task is to minimize one cost scale, called target cost, and keep the other one within specified budgets. To the best of our knowledge, this is the first attempt to study the cost sensitive learning with multiple costs scales. The second model is based on the assumption that some attributes of an unlabeled example are known before being classified. A test cost sensitive lazy tree model is proposed to utilize the known information to reduce the overall cost. We also modify and apply this model to the batch-test problem: multiple tests are chosen and done in one shot, rather than in a sequential manner in the test-sensitive tree. It is significant in some diagnosis applications that require a decision to be made as soon as possible, such as emergency treatment. Extensive experiments are conducted for evaluating the proposed approaches, and demonstrate that the work in this dissertation is efficient and useful for many diagnostic tasks involving target cost minimization and resource utilization for obtaining missing information

    Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm

    Full text link
    This paper introduces ICET, a new algorithm for cost-sensitive classification. ICET uses a genetic algorithm to evolve a population of biases for a decision tree induction algorithm. The fitness function of the genetic algorithm is the average cost of classification when using the decision tree, including both the costs of tests (features, measurements) and the costs of classification errors. ICET is compared here with three other algorithms for cost-sensitive classification - EG2, CS-ID3, and IDX - and also with C4.5, which classifies without regard to cost. The five algorithms are evaluated empirically on five real-world medical datasets. Three sets of experiments are performed. The first set examines the baseline performance of the five algorithms on the five datasets and establishes that ICET performs significantly better than its competitors. The second set tests the robustness of ICET under a variety of conditions and shows that ICET maintains its advantage. The third set looks at ICET's search in bias space and discovers a way to improve the search.Comment: See http://www.jair.org/ for any accompanying file

    A survey of cost-sensitive decision tree induction algorithms

    Get PDF
    The past decade has seen a significant interest on the problem of inducing decision trees that take account of costs of misclassification and costs of acquiring the features used for decision making. This survey identifies over 50 algorithms including approaches that are direct adaptations of accuracy based methods, use genetic algorithms, use anytime methods and utilize boosting and bagging. The survey brings together these different studies and novel approaches to cost-sensitive decision tree learning, provides a useful taxonomy, a historical timeline of how the field has developed and should provide a useful reference point for future research in this field

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

    Get PDF
    Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure
    • …
    corecore