41,893 research outputs found

    Cost-Sensitive Decision Tree with Multiple Resource Constraints

    Get PDF
    Resource constraints are commonly found in classification tasks. For example, there could be a budget limit on implementation and a deadline for finishing the classification task. Applying the top-down approach for tree induction in this situation may have significant drawbacks. In particular, it is difficult, especially in an early stage of tree induction, to assess an attribute’s contribution to improving the total implementation cost and its impact on attribute selection in later stages because of the deadline constraint. To address this problem, we propose an innovative algorithm, namely, the Cost-Sensitive Associative Tree (CAT) algorithm. Essentially, the algorithm first extracts and retains association classification rules from the training data which satisfy resource constraints, and then uses the rules to construct the final decision tree. The approach has advantages over the traditional top-down approach, first because only feasible classification rules are considered in the tree induction and, second, because their costs and resource use are known. In contrast, in the top-down approach, the information is not available for selecting splitting attributes. The experiment results show that the CAT algorithm significantly outperforms the top-down approach and adapts very well to available resources.Cost-sensitive learning, mining methods and algorithms, decision trees

    A comparative analysis of decision trees vis-a-vis other computational data mining techniques in automotive insurance fraud detection

    Get PDF
    The development and application of computational data mining techniques in financial fraud detection and business failure prediction has become a popular cross-disciplinary research area in recent times involving financial economists, forensic accountants and computational modellers. Some of the computational techniques popularly used in the context of - financial fraud detection and business failure prediction can also be effectively applied in the detection of fraudulent insurance claims and therefore, can be of immense practical value to the insurance industry. We provide a comparative analysis of prediction performance of a battery of data mining techniques using real-life automotive insurance fraud data. While the data we have used in our paper is US-based, the computational techniques we have tested can be adapted and generally applied to detect similar insurance frauds in other countries as well where an organized automotive insurance industry exists

    CSNL: A cost-sensitive non-linear decision tree algorithm

    Get PDF
    This article presents a new decision tree learning algorithm called CSNL that induces Cost-Sensitive Non-Linear decision trees. The algorithm is based on the hypothesis that nonlinear decision nodes provide a better basis than axis-parallel decision nodes and utilizes discriminant analysis to construct nonlinear decision trees that take account of costs of misclassification. The performance of the algorithm is evaluated by applying it to seventeen datasets and the results are compared with those obtained by two well known cost-sensitive algorithms, ICET and MetaCost, which generate multiple trees to obtain some of the best results to date. The results show that CSNL performs at least as well, if not better than these algorithms, in more than twelve of the datasets and is considerably faster. The use of bagging with CSNL further enhances its performance showing the significant benefits of using nonlinear decision nodes. The performance of the algorithm is evaluated by applying it to seventeen data sets and the results are compared with those obtained by two well known cost-sensitive algorithms, ICET and MetaCost, which generate multiple trees to obtain some of the best results to date. The results show that CSNL performs at least as well, if not better than these algorithms, in more than twelve of the data sets and is considerably faster. The use of bagging with CSNL further enhances its performance showing the significant benefits of using non-linear decision nodes

    A survey on utilization of data mining approaches for dermatological (skin) diseases prediction

    Get PDF
    Due to recent technology advances, large volumes of medical data is obtained. These data contain valuable information. Therefore data mining techniques can be used to extract useful patterns. This paper is intended to introduce data mining and its various techniques and a survey of the available literature on medical data mining. We emphasize mainly on the application of data mining on skin diseases. A categorization has been provided based on the different data mining techniques. The utility of the various data mining methodologies is highlighted. Generally association mining is suitable for extracting rules. It has been used especially in cancer diagnosis. Classification is a robust method in medical mining. In this paper, we have summarized the different uses of classification in dermatology. It is one of the most important methods for diagnosis of erythemato-squamous diseases. There are different methods like Neural Networks, Genetic Algorithms and fuzzy classifiaction in this topic. Clustering is a useful method in medical images mining. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. Clustering has some applications in dermatology. Besides introducing different mining methods, we have investigated some challenges which exist in mining skin data

    Statistical methods for tissue array images - algorithmic scoring and co-training

    Full text link
    Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We propose an algorithm - Tissue Array Co-Occurrence Matrix Analysis (TACOMA) - for quantifying cellular phenotypes based on textural regularity summarized by local inter-pixel relationships. The algorithm can be easily trained for any staining pattern, is absent of sensitive tuning parameters and has the ability to report salient pixels in an image that contribute to its score. Pathologists' input via informative training patches is an important aspect of the algorithm that allows the training for any specific marker or cell type. With co-training, the error rate of TACOMA can be reduced substantially for a very small training sample (e.g., with size 30). We give theoretical insights into the success of co-training via thinning of the feature set in a high-dimensional setting when there is "sufficient" redundancy among the features. TACOMA is flexible, transparent and provides a scoring process that can be evaluated with clarity and confidence. In a study based on an estrogen receptor (ER) marker, we show that TACOMA is comparable to, or outperforms, pathologists' performance in terms of accuracy and repeatability.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS543 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Prediction of delayed graft function after kidney transplantation : comparison between logistic regression and machine learning methods

    Get PDF
    Background: Predictive models for delayed graft function (DGF) after kidney transplantation are usually developed using logistic regression. We want to evaluate the value of machine learning methods in the prediction of DGF. Methods: 497 kidney transplantations from deceased donors at the Ghent University Hospital between 2005 and 2011 are included. A feature elimination procedure is applied to determine the optimal number of features, resulting in 20 selected parameters (24 parameters after conversion to indicator parameters) out of 55 retrospectively collected parameters. Subsequently, 9 distinct types of predictive models are fitted using the reduced data set: logistic regression (LR), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), support vector machines (SVMs; using linear, radial basis function and polynomial kernels), decision tree (DT), random forest (RF), and stochastic gradient boosting (SGB). Performance of the models is assessed by computing sensitivity, positive predictive values and area under the receiver operating characteristic curve (AUROC) after 10-fold stratified cross-validation. AUROCs of the models are pairwise compared using Wilcoxon signed-rank test. Results: The observed incidence of DGF is 12.5 %. DT is not able to discriminate between recipients with and without DGF (AUROC of 52.5 %) and is inferior to the other methods. SGB, RF and polynomial SVM are mainly able to identify recipients without DGF (AUROC of 77.2, 73.9 and 79.8 %, respectively) and only outperform DT. LDA, QDA, radial SVM and LR also have the ability to identify recipients with DGF, resulting in higher discriminative capacity (AUROC of 82.2, 79.6, 83.3 and 81.7 %, respectively), which outperforms DT and RF. Linear SVM has the highest discriminative capacity (AUROC of 84.3 %), outperforming each method, except for radial SVM, polynomial SVM and LDA. However, it is the only method superior to LR. Conclusions: The discriminative capacities of LDA, linear SVM, radial SVM and LR are the only ones above 80 %. None of the pairwise AUROC comparisons between these models is statistically significant, except linear SVM outperforming LR. Additionally, the sensitivity of linear SVM to identify recipients with DGF is amongst the three highest of all models. Due to both reasons, the authors believe that linear SVM is most appropriate to predict DGF
    • …
    corecore