682 research outputs found

    New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset

    Get PDF
    One of the most challenging problems in the real-world dataset is the rising numbers of imbalanced data. The fact that the ratio of the majorities is higher than the minorities will lead to misleading results as conventional machine learning algorithms were designed on the assumption of equal class distribution. The purpose of this study is to build a hybrid data preprocessing approach to deal with the class imbalance issue by applying resampling approaches and CSL for fraud detection using a real-world dataset. The proposed hybrid approach consists of two steps in which the first step is to compare several resampling approaches to find the optimum technique with the highest performance in the validation set. While the second method used CSL with optimal weight ratio on the resampled data from the first step. The hybrid technique was found to have a positive impact of 0.987, 0.974, 0.847, 0.853 F2-measure for RF, DT, XGBOOST and LGBM, respectively. Additionally, relative to the conventional methods, it obtained the highest performance for prediction

    Customer retention

    Get PDF
    A research report submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in partial fulfillment of the requirements for the degree of Master of Science in Engineering. Johannesburg, May 2018The aim of this study is to model the probability of a customer to attrite/defect from a bank where, for example, the bank is not their preferred/primary bank for salary deposits. The termination of deposit inflow serves as the outcome parameter and the random forest modelling technique was used to predict the outcome, in which new data sources (transactional data) were explored to add predictive power. The conventional logistic regression modelling technique was used to benchmark the random forest’s results. It was found that the random forest model slightly overfit during the training process and loses predictive power during validation and out of training period data. The random forest model, however, remains predictive and performs better than logistic regression at a cut-off probability of 20%.MT 201

    Learning from high-dimensional and class-imbalanced datasets using random forests

    Get PDF
    Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone

    Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

    Get PDF
    In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem

    Discovering patterns in a survey of secondary injuries due to agricultural assistive technology

    Get PDF
    The research is motivated by the need for hazard assessment in agriculture field. A small and highly-imbalanced dataset, in which negative instances heavily outnumber positive instances, is derived from a survey of secondary injuries induced by implementation of agriculture assistive technology which assists farmers with injuries or disabilities to continue farm-related work. Three data mining approaches are applied to the imbalanced dataset in order to discover patterns contributing to secondary injuries. All of patterns discovered by the three approaches are compared according to three evaluation measurements: support, confidence and lift, and potentially most interesting patterns are found. Compared to graphical exploratory analysis which figures out causative factors by evaluating the single effects of attributes on the occurrence of secondary injuries, decision tree algorithm and subgroup discovery algorithms are able to find combinational factors by evaluating the interactive effects of attributes on the occurrence of secondary injuries. Graphical exploratory analysis is able to find patterns with highest support and subgroup discovery algorithms are good at finding high lift patterns. In addition, the experimental analysis of applying subgroup discovery to our secondary injury dataset demonstrates subgroup discovery method\u27s capability of dealing with imbalanced datasets. Therefore, identifying risk factors contributing to secondary injuries, as well as providing a useful alternative method (subgroup discovery) of dealing with small and highly-imbalanced datasets are important outcomes of this thesis

    Neural plasma

    Get PDF
    This paper presents a novel type of artificial neural network, called neural plasma, which is tailored for classification tasks involving few observations with a large number of variables. Neural plasma learns to adapt its classification confidence by generating artificial training data as a function of its confidence in previous decisions. In contrast to multilayer perceptrons and similar techniques, which are inspired by topological and operational aspects of biological neural networks, neural plasma is motivated by aspects of high-level behavior and reasoning in the presence of uncertainty. The basic principles of the proposed model apply to other supervised learning algorithms that provide explicit classification confidence values. The empirical evaluation of this new technique is based on benchmarking experiments involving data sets from biotechnology that are characterized by the small-n-large-p problem. The presented study exposes a comprehensive methodology and is seen as a first step in exploring different aspects of this methodology.IFIP International Conference on Artificial Intelligence in Theory and Practice - Neural NetsRed de Universidades con Carreras en Informática (RedUNCI
    • …
    corecore