12 research outputs found

    Ensemble of Example-Dependent Cost-Sensitive Decision Trees

    Get PDF
    Several real-world classification problems are example-dependent cost-sensitive in nature, where the costs due to misclassification vary between examples and not only within classes. However, standard classification methods do not take these costs into account, and assume a constant cost of misclassification errors. In previous works, some methods that take into account the financial costs into the training of different algorithms have been proposed, with the example-dependent cost-sensitive decision tree algorithm being the one that gives the highest savings. In this paper we propose a new framework of ensembles of example-dependent cost-sensitive decision-trees. The framework consists in creating different example-dependent cost-sensitive decision trees on random subsamples of the training set, and then combining them using three different combination approaches. Moreover, we propose two new cost-sensitive combination approaches; cost-sensitive weighted voting and cost-sensitive stacking, the latter being based on the cost-sensitive logistic regression method. Finally, using five different databases, from four real-world applications: credit card fraud detection, churn modeling, credit scoring and direct marketing, we evaluate the proposed method against state-of-the-art example-dependent cost-sensitive techniques, namely, cost-proportionate sampling, Bayes minimum risk and cost-sensitive decision trees. The results show that the proposed algorithms have better results for all databases, in the sense of higher savings.Comment: 13 pages, 6 figures, Submitted for possible publicatio

    A survey of cost-sensitive decision tree induction algorithms

    Get PDF
    The past decade has seen a significant interest on the problem of inducing decision trees that take account of costs of misclassification and costs of acquiring the features used for decision making. This survey identifies over 50 algorithms including approaches that are direct adaptations of accuracy based methods, use genetic algorithms, use anytime methods and utilize boosting and bagging. The survey brings together these different studies and novel approaches to cost-sensitive decision tree learning, provides a useful taxonomy, a historical timeline of how the field has developed and should provide a useful reference point for future research in this field

    EBNO : evolution of cost-sensitive Bayesian networks

    Get PDF
    The last decade has seen an increase in the attention paid to the development of cost sensitive learning algorithms that aim to minimize misclassification costs while still maintaining accuracy. Most of this attention has been on cost sensitive decision tree learning, while relatively little attention has been paid to assess if it is possible to develop better cost sensitive classifiers based on Bayesian networks. Hence, this paper presents EBNO, an algorithm that utilizes Genetic Algorithms to learn cost sensitive Bayesian networks; where genes are utilized to represent the links between the nodes in Bayesian networks and the expected cost is used as a fitness function. An empirical comparison of the new algorithm has been carried out with respect to: (i) an algorithm that induces cost-insensitive Bayesian networks to provide a base line, (ii) ICET, a well-known algorithm that uses Genetic Algorithms to induce cost-sensitive decision trees, (iii) use of MetaCost to induce cost-sensitive Bayesian networks via bagging (iv) use of AdaBoost to induce cost-sensitive Bayesian networks and (v) use of XGBoost, a gradient boosting algorithm, to induce cost-sensitive decision trees. An empirical evaluation on 28 data sets reveals that EBNO performs well in comparison to the algorithms that produce single interpretable models and performs just as well as algorithms that use bagging and boosting methods

    Temporal Image Forensics for Picture Dating based on Machine Learning

    Get PDF
    Temporal image forensics involves the investigation of multi-media digital forensic material related to crime with the goal of obtaining accurate evidence concerning activity and timing to be presented in a court of law. Because of the ever-increasing complexity of crime in the digital age, forensic investigations are increasingly dependent on timing information. The simplest way to extract such forensic information would be the use of the EXIF header of picture files as it contains most of the information. However, these header data can be easily removed or manipulated and hence cannot be evidential, and so estimating the acquisition time of digital photographs has become more challenging. This PhD research proposes to use image contents instead of file headers to solve this problem. In this thesis, a number of contributions are presented in the area of temporal image forensics to predict picture dating. Firstly, the present research introduces the unique Northumbria Temporal Image Forensics (NTIF) database of pictures for the purpose of temporal image forensic purposes. As digital sensors age, the changes in Photo Response Non-Uniformity (PRNU) over time have been highlighted using the NTIF database, and it is concluded that PRNU cannot be useful feature for picture dating application. Apart from the PRNU, defective pixels also constitute another sensor imperfection of forensic relevance. Secondly, this thesis highlights the fact that the filter-based PRNU technique is useful for source camera identification application as compared to deep convolutional neural networks when limited amounts of images under investigation are available to the forensic analyst. The results concluded that due to sensor pattern noise feature which is location-sensitive, the performance of CNN-based approach declines because sensor pattern noise image blocks are fed at different locations into CNN for the same category. Thirdly, the deep learning technique is applied for picture dating, which has shown promising results with performance levels up to 80% to 88% depending on the digital camera used. The key findings indicate that a deep learning approach can successfully learn the temporal changes in image contents, rather than the sensor pattern noise. Finally, this thesis proposes a technique to estimate the acquisition time slots of digital pictures using a set of candidate defective pixel locations in non-overlapping image blocks. The temporal behaviour of camera sensor defects in digital pictures are analyzed using a machine learning technique in which potential candidate defective pixels are determined according to the related pixel neighbourhood and two proposed features called local variation features. The idea of virtual timescales using halves of real time slots and a combination of prediction scores for image blocks has been proposed to enhance performance. When assessed using the NTIF image dataset, the proposed system has been shown to achieve very promising results with an estimated accuracy of the acquisition times of digital pictures between 88% and 93%, exhibiting clear superiority over relevant state-of-the-art systems

    Utilising restricted for-loops in genetic programming

    Get PDF
    Genetic programming is an approach that utilises the power of evolution to allow computers to evolve programs. While loops are natural components of most programming languages and appear in every reasonably-sized application, they are rarely used in genetic programming. The work is to investigate a number of restricted looping constructs to determine whether any significant benefits can be obtained in genetic programming. Possible benefits include: Solving problems which cannot be solved without loops, evolving smaller sized solutions which can be more easily understood by human programmers and solving existing problems quicker by using fewer evaluations. In this thesis, a number of explicit restricted loop formats were formulated and tested on the Santa Fe ant problem, a modified ant problem, a sorting problem, a visit-every-square problem and a difficult object classificat ion problem. The experimental results showed that these explicit loops can be successfully used in genetic programming. The evolutionary process can decide when, where and how to use them. Runs with these loops tended to generate smaller sized solutions in fewer evaluations. Solutions with loops were found to some problems that could not be solved without loops. The results and analysis of this thesis have established that there are significant benefits in using loops in genetic programming. Restricted loops can avoid the difficulties of evolving consistent programs and the infinite iterations problem. Researchers and other users of genetic programming should not be afraid of loops

    Elliptical cost-sensitive decision tree algorithm - ECSDT

    Get PDF
    Cost-sensitive multiclass classification problems, in which the task of assessing the impact of the costs associated with different misclassification errors, continues to be one of the major challenging areas for data mining and machine learning. The literature reviews in this area show that most of the cost-sensitive algorithms that have been developed during the last decade were developed to solve binary classification problems where an example from the dataset will be classified into only one of two available classes. Much of the research on cost-sensitive learning has focused on inducing decision trees, which are one of the most common and widely used classification methods, due to the simplicity of constructing them, their transparency and comprehensibility. A review of the literature shows that inducing nonlinear multiclass cost-sensitive decision trees is still in its early stages and further research could result in improvements over the current state of the art. Hence, this research aims to address the following question: How can non-linear regions be identified for multiclass problems and utilized to construct decision trees so as to maximize the accuracy of classification, and minimize misclassification costs? This research addresses this problem by developing a new algorithm called the Elliptical Cost-Sensitive Decision Tree algorithm (ECSDT) that induces cost-sensitive non-linear (elliptical) decision trees for multiclass classification problems using evolutionary optimization methods such as particle swarm optimization (PSO) and Genetic Algorithms (GAs). In this research, ellipses are used as non-linear separators, because of their simplicity and flexibility in drawing non-linear boundaries by modifying and adjusting their size, location and rotation towards achieving optimal results. The new algorithm was developed, tested, and evaluated in three different settings, each with a different objective function. The first considered maximizing the accuracy of classification only; the second focused on minimizing misclassification costs only, while the third considered both accuracy and misclassification cost together. ECSDT was applied to fourteen different binary-class and multiclass data sets and the results have been compared with those obtained by applying some common algorithms from Weka to the same datasets such as J48, NBTree, MetaCost, and the CostSensitiveClassifier. The primary contribution of this research is the development of a new algorithm that shows the benefits of utilizing elliptical boundaries for cost-sensitive decision tree learning. The new algorithm is capable of handling multiclass problems and an empirical evaluation shows good results. More specifically, when considering accuracy only, ECSDT performs better in terms of maximizing accuracy on 10 out of the 14 datasets, and when considering minimizing misclassification costs only, ECSDT performs better on 10 out of the 14 datasets, while when considering both accuracy and misclassification costs, ECSDT was able to obtain higher accuracy on 10 out of the 14 datasets and minimize misclassification costs on 5 out of the 14 datasets. The ECSDT also was able to produce smaller trees when compared with J48, LADTree and ADTree
    corecore