83 research outputs found

    A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data

    Get PDF
    Class imbalance presents a major hurdle in the application of classification methods. A commonly taken approach is to learn ensembles of classifiers using rebalanced data. Examples include bootstrap averaging (bagging) combined with either undersampling or oversampling of the minority class examples. However, rebalancing methods entail asymmetric changes to the examples of different classes, which in turn can introduce their own biases. Furthermore, these methods often require specifying the performance measure of interest a priori, i.e., before learning. An alternative is to employ the threshold moving technique, which applies a threshold to the continuous output of a model, offering the possibility to adapt to a performance measure a posteriori, i.e., a plug-in method. Surprisingly, little attention has been paid to this combination of a bagging ensemble and threshold-moving. In this paper, we study this combination and demonstrate its competitiveness. Contrary to the other resampling methods, we preserve the natural class distribution of the data resulting in well-calibrated posterior probabilities. Additionally, we extend the proposed method to handle multiclass data. We validated our method on binary and multiclass benchmark data sets by using both, decision trees and neural networks as base classifiers. We perform analyses that provide insights into the proposed method. Keywords: Imbalanced data; Binary classification; Multiclass classification; Bagging ensembles; Resampling; Posterior calibrationBurroughs Wellcome Fund (Grant 103811AI

    A Descriptive Study of Variable Discretization and Cost-Sensitive Logistic Regression on Imbalanced Credit Data

    Get PDF
    Training classification models on imbalanced data tends to result in bias towards the majority class. In this paper, we demonstrate how variable discretization and cost-sensitive logistic regression help mitigate this bias on an imbalanced credit scoring dataset, and further show the application of the variable discretization technique on the data from other domains, demonstrating its potential as a generic technique for classifying imbalanced data beyond credit scoring. The performance measurements include ROC curves, Area under ROC Curve (AUC), Type I Error, Type II Error, accuracy, and F1 score. The results show that proper variable discretization and cost-sensitive logistic regression with the best class weights can reduce the model bias and/or variance. From the perspective of the algorithm, cost-sensitive logistic regression is beneficial for increasing the value of predictors even if they are not in their optimized forms while maintaining monotonicity. From the perspective of predictors, the variable discretization performs better than cost-sensitive logistic regression, provides more reasonable coefficient estimates for predictors which have nonlinear relationships against their empirical logit, and is robust to penalty weights on misclassifications of events and non-events determined by their apriori proportions

    Distinct Multiple Learner-Based Ensemble SMOTEBagging (ML-ESB) Method for Classification of Binary Class Imbalance Problems

    Get PDF
    Traditional classification algorithms often fail in learning from highly imbalanced datasets because the training involves most of the samples from majority class compared to the other existing minority class. In this paper, a Multiple Learners-based Ensemble SMOTEBagging (ML-ESB) technique is proposed. The ML-ESB algorithm is a modified SMOTEBagging technique in which the ensemble of multiple instances of the single learner is replaced by multiple distinct classifiers. The proposed ML-ESB is designed for handling only the binary class imbalance problem. In ML-ESB the ensembles of multiple distinct classifiers include Naïve Bays, Support Vector Machine, Logistic Regression and Decision Tree (C4.5) is used. The performance of ML-ESB is evaluated based on six binary imbalanced benchmark datasets using evaluation measures such as specificity, sensitivity, and area under receiver operating curve. The obtained results are compared with those of SMOTEBagging, SMOTEBoost, and cost-sensitive MCS algorithms with different imbalance ratios (IR). The ML-ESB algorithm outperformed other existing methods on four datasets with high dimensions and class IR, whereas it showed moderate performance on the remaining two low dimensions and small IR value datasets

    A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

    Full text link
    Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose the first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams

    Probabilistic XGBoost Threshold Classification with Autoencoder for Credit Card Fraud Detection

    Get PDF
    Due to the imbalanced data of outnumbered legitimate transactions than the fraudulent transaction, the detection of fraud is a challenging task to find an effective solution. In this study, autoencoder with probabilistic threshold shifting of XGBoost (AE-XGB) for credit card fraud detection is designed. Initially, AE-XGB employs autoencoder the prevalent dimensionality reduction technique to extract data features from latent space representation. Then the reconstructed lower dimensional features utilize eXtreame Gradient Boost (XGBoost), an ensemble boosting algorithm with probabilistic threshold to classify the data as fraudulent or legitimate. In addition to AE-XGB, other existing ensemble algorithms such as Adaptive Boosting (AdaBoost), Gradient Boosting Machine (GBM), Random Forest, Categorical Boosting (CatBoost), LightGBM and XGBoost are compared with optimal and default threshold. To validate the methodology, we used IEEE-CIS fraud detection dataset for our experiment. Class imbalance and high dimensionality characteristics of dataset reduce the performance of model hence the data is preprocessed and trained. To evaluate the performance of the model, evaluation indicators such as precision, recall, f1-score, g-mean and Mathews Correlation Coefficient (MCC) are accomplished. The findings revealed that the performance of the proposed AE-XGB model is effective in handling imbalanced data and able to detect fraudulent transactions with 90.4% of recall and 90.5% of f1-score from incoming new transactions

    On the class overlap problem in imbalanced data classification.

    Get PDF
    Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance

    Genetic algorithm based feature selection with ensemble methods for student academic performance prediction

    Get PDF
    Student academic performance is an important factor that affect the achievement of an educational institution. Educational Data Mining (EDM) is a data mining process that is applied to explore educational data that can produce information related to student academic performance. The knowledge produced from the data mining process is used by the educational institutions to improve their teaching processes, which aim to improve student academic performance results. In this paper, a method based on Genetic Algorithm (GA) feature selection technique with classification method is proposed in order to predict student academic performance. Almost all previous feature selection techniques apply local search technique throughout the process, so the optimal solution is quite difficult to achieve. Therefore, GA is apply as a technique of features selection with ensemble classification method in order to improve classification accuracy value of student academic performance prediction, as well as it can be used for datasets with high dimensional and imbalanced class. In this paper, the data used for experiments comes from Kaggle repository datasets which consists of three main categories: behaviour, academic, and demographic. The performances evaluation used to evaluate the proposed method is the Area Under the Curve (AUC). Based on the results obtained from the experiments, shows that the proposed method makes an impressive result in the predictions of student academic performance

    Assessing the predictive ability of the Suicide Crisis Inventory for near-term suicidal behavior using machine learning approaches

    Get PDF
    OBJECTIVE: This study explores the prediction of near-term suicidal behavior using machine learning (ML) analyses of the Suicide Crisis Inventory (SCI), which measures the Suicide Crisis Syndrome, a presuicidal mental state. METHODS: SCI data were collected from high-risk psychiatric inpatients (N = 591) grouped based on their short-term suicidal behavior, that is, those who attempted suicide between intake and 1-month follow-up dates (N = 20) and those who did not (N = 571). Data were analyzed using three predictive algorithms (logistic regression, random forest, and gradient boosting) and three sampling approaches (split sample, Synthetic minority oversampling technique, and enhanced bootstrap). RESULTS: The enhanced bootstrap approach considerably outperformed the other sampling approaches, with random forest (98.0% precision; 33.9% recall; 71.0% Area under the precision-recall curve [AUPRC]; and 87.8% Area under the receiver operating characteristic [AUROC]) and gradient boosting (94.0% precision; 48.9% recall; 70.5% AUPRC; and 89.4% AUROC) algorithms performing best in predicting positive cases of near-term suicidal behavior using this dataset. CONCLUSIONS: ML can be useful in analyzing data from psychometric scales, such as the SCI, and for predicting near-term suicidal behavior. However, in cases such as the current analysis where the data are highly imbalanced, the optimal method of measuring performance must be carefully considered and selected
    corecore