1,071 research outputs found

    Hyperparameter optimisation for improving classification under class imbalance

    Get PDF
    Although the class-imbalance classification problem has caught a huge amount of attention, hyperparameter optimisation has not been studied in detail in this field. Both classification algorithms and resampling techniques involve some hyperparameters that can be tuned. This paper sets up several experiments and draws the conclusion that, compared to using default hyperparameters, applying hyperparameter optimisation for both classification algorithms and resampling approaches can produce the best results for classifying the imbalanced datasets. Moreover, this paper shows that data complexity, especially the overlap between classes, has a big impact on the potential improvement that can be achieved through hyperparameter optimisation. Results of our experiments also indicate that using resampling techniques cannot improve the performance for some complex datasets, which further emphasizes the importance of analyzing data complexity before dealing with imbalanced datasets.Algorithms and the Foundations of Software technolog

    An empirical evaluation of imbalanced data strategies from a practitioner's point of view

    Full text link
    This research tested the following well known strategies to deal with binary imbalanced data on 82 different real life data sets (sampled to imbalance rates of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline (just the base classifier). As base classifiers we used SVM with RBF kernel, random forests, and gradient boosting machines and we measured the quality of the resulting classifier using 6 different metrics (Area under the curve, Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced accuracy). The best strategy strongly depends on the metric used to measure the quality of the classifier. For AUC and accuracy class weight and the baseline perform better; for F-measure and MCC, SMOTE performs better; and for G-mean and balanced accuracy, underbagging

    Automated grading of chest x-ray images for viral pneumonia with convolutional neural networks ensemble and region of interest localization

    Get PDF
    Following its initial identification on December 31, 2019, COVID-19 quickly spread around the world as a pandemic claiming more than six million lives. An early diagnosis with appropriate intervention can help prevent deaths and serious illness as the distinguishing symptoms that set COVID-19 apart from pneumonia and influenza frequently don't show up until after the patient has already suffered significant damage. A chest X-ray (CXR), one of many imaging modalities that are useful for detection and one of the most used, offers a non-invasive method of detection. The CXR image analysis can also reveal additional disorders, such as pneumonia, which show up as anomalies in the lungs. Thus these CXRs can be used for automated grading aiding the doctors in making a better diagnosis. In order to classify a CXR image into the Negative for Pneumonia, Typical, Indeterminate, and Atypical, we used the publicly available CXR image competition dataset SIIM-FISABIO-RSNA COVID-19 from Kaggle. The suggested architecture employed an ensemble of EfficientNetv2-L for classification, which was trained via transfer learning from the initialised weights of ImageNet21K on various subsets of data (Code for the proposed methodology is available at: https://github.com/asadkhan1221/siim-covid19.git). To identify and localise opacities, an ensemble of YOLO was combined using Weighted Boxes Fusion (WBF). Significant generalisability gains were made possible by the suggested technique's addition of classification auxiliary heads to the CNN backbone. The suggested method improved further by utilising test time augmentation for both classifiers and localizers. The results for Mean Average Precision score show that the proposed deep learning model achieves 0.617 and 0.609 on public and private sets respectively and these are comparable to other techniques for the Kaggle dataset

    Using recency, frequency and monetary variables to predict customer lifetime value with XGBoost

    Get PDF
    CRM) will continue to gain prominence in the coming years. A commonly used CRM metric called Customer Lifetime Value (CLV) is the value a customer will contribute while they are an active customer. This study investigated the ability of supervised machine learning models constructed with XGBoost to predict future CLV, as well as the likelihood that a customer will drop to a lower CLV in the future. One approach to determining CLV, called the RFM method, is done by isolating recency (R), frequency (F) and (M) monetary values. The produced models used these RFM variables and also assessed if including temporal, product, and other customer transaction information assisted the XGBoost classifier in making better predictions. The classification models were constructed by extracting each customer's RFM values and transaction information from a Fast Mover Consumer Goods dataset. Different variations of CLV were calculated through one- and two-dimensional K-means clustering of the M (Monetary), F and M (Profitability), F and R (Loyalty), as well as the R and M (Burgeoning) variables. Two additional CLV variations were also determined by isolating the M tercile segments and a commonly used weighted-RFM approach. To test the effectiveness of XGBoost in predicting future timeframes, the dataset was divided into three consecutive periods, where the first period formed the features used to predict the target CLV variables in the second and third periods. Models that predicted if CLV dropped to a lower value from the first to the second and from the first to the third periods were also constructed. It was found that the XGBoost models were moderately to highly effective in classifying future CLV in both the second and third periods. The models also effectively predicted if CLV would drop to a lower value in both future periods. The ability to predict future CLV and CLV drop in the second period, was only slightly better than the ability to predict the future CLV in the third period. Models constructed by adding additional temporal, product, and customer transaction information to the RFM values did not improve on those created that used only the RFM values. These findings illustrate the effectiveness of XGBoost as a predictor for future CLV and CLV drop, as well as affirming the efficacy of utilising RFM values to determine future CLV

    A comparative analysis of machine learning models for corporate default forecasting

    Get PDF
    This study examines the potential benefits of utilizing machine learning models for default forecasting by comparing the discriminatory power of the random forest and XGBoost models with traditional statistical models. The results of the evaluation with out-of-time predictions show that the machine learning models exhibit a higher discriminatory power compared to the traditional models. The reduction in the sample size of the training dataset leads to a decrease in predictive power of the machine learning models, reducing the difference in performance between the two model types. While modifications in model dimensionality have a limited impact on the discriminatory power of the statistical models, the predictive power of machine learning models increases with the addition of further predictors. When employing a clustering approach, both traditional and machine learning models exhibit an improvement in discriminatory power in the small, medium, and large firm size clusters compared to the previous non-clustering specifications. Machine learning models exhibit a significantly higher ability to classify micro firms. The findings of this research indicate that the machine learning models exhibit superior discriminatory power compared to the traditional models across the different specifications. Machine learning models can be used to forecast the potential impact of corporate default of non-financial micro cooperations on the Portuguese labour market by estimating the number of jobs at risk

    A comparative analysis of machine learning models for corporate default forecasting

    Get PDF
    This study examines the potential benefits of utilizing machine learning models for default forecasting by comparing the discriminatory power of the random forest and XGBoost models with traditional statistical models. The results of the evaluation with out-of-time predictions show that the machine learning models exhibit a higher discriminatory power compared to the traditional models. The reduction in the sample size of the training dataset leads to a decrease in predictive power of the machine learning models, reducing the difference in performance between the two model types. While modifications in model dimensionality have a limited impact on the discriminatory power of the statistical models, the predictive power of machine learning models increases with the addition of further predictors. When employing a clustering approach, both traditional and machine learning models exhibit an improvement in discriminatory power in the small, medium, and large firm size clusters compared to the previous non-clustering specifications. Machine learning models exhibit a significantly higher ability to classify micro firms. The findings of this research indicate that the machine learning models exhibit superior discriminatory power compared to the traditional models across the different specifications. Machine learning models can be used to forecast the potential impact of corporate default of non-financial micro cooperations on the Portuguese labour market by estimating the number of jobs at risk
    corecore