272 research outputs found

    Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

    Get PDF
    We aim at developing and improving the imbalanced business risk modeling via jointly using proper evaluation criteria, resampling, cross-validation, classifier regularization, and ensembling techniques. Area Under the Receiver Operating Characteristic Curve (AUC of ROC) is used for model comparison based on 10-fold cross-validation. Two undersampling strategies including random undersampling (RUS) and cluster centroid undersampling (CCUS), as well as two oversampling methods including random oversampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE), are applied. Three highly interpretable classifiers, including logistic regression without regularization (LR), L1-regularized LR (L1LR), and decision tree (DT) are implemented. Two ensembling techniques, including Bagging and Boosting, are applied to the DT classifier for further model improvement. The results show that Boosting on DT by using the oversampled data containing 50% positives via SMOTE is the optimal model and it can achieve AUC, recall, and F1 score valued 0.8633, 0.9260, and 0.8907, respectively

    Leveraging augmentation techniques for tasks with unbalancedness within the financial domain: a two-level ensemble approach

    Get PDF
    Modern financial markets produce massive datasets that need to be analysed using new modelling techniques like those from (deep) Machine Learning and Artificial Intelligence. The common goal of these techniques is to forecast the behaviour of the market, which can be translated into various classification tasks, such as, for instance, predicting the likelihood of companies’ bankruptcy or in fraud detection systems. However, it is often the case that real-world financial data are unbalanced, meaning that the classes’ distribution is not equally represented in such datasets. This gives the main issue since any Machine Learning model is trained according to the majority class mainly, leading to inaccurate predictions. In this paper, we explore different data augmentation techniques to deal with very unbalanced financial data. We consider a number of publicly available datasets, then apply state-of-the-art augmentation strategies to them, and finally evaluate the results for several Machine Learning models trained on the sampled data. The performance of the various approaches is evaluated according to their accuracy, micro, and macro F1 score, and finally by analyzing the precision and recall over the minority class. We show that a consistent and accurate improvement is achieved when data augmentation is employed. The obtained classification results look promising and indicate the efficiency of augmentation strategies on financial tasks. On the basis of these results, we present an approach focused on classification tasks within the financial domain that takes a dataset as input, identifies what kind of augmentation technique to use, and then applies an ensemble of all the augmentation techniques of the identified type to the input dataset along with an ensemble of different methods to tackle the underlying classification

    A comparative analysis of machine learning models for corporate default forecasting

    Get PDF
    This study examines the potential benefits of utilizing machine learning models for default forecasting by comparing the discriminatory power of the random forest and XGBoost models with traditional statistical models. The results of the evaluation with out-of-time predictions show that the machine learning models exhibit a higher discriminatory power compared to the traditional models. The reduction in the sample size of the training dataset leads to a decrease in predictive power of the machine learning models, reducing the difference in performance between the two model types. While modifications in model dimensionality have a limited impact on the discriminatory power of the statistical models, the predictive power of machine learning models increases with the addition of further predictors. When employing a clustering approach, both traditional and machine learning models exhibit an improvement in discriminatory power in the small, medium, and large firm size clusters compared to the previous non-clustering specifications. Machine learning models exhibit a significantly higher ability to classify micro firms. The findings of this research indicate that the machine learning models exhibit superior discriminatory power compared to the traditional models across the different specifications. Machine learning models can be used to forecast the potential impact of corporate default of non-financial micro cooperations on the Portuguese labour market by estimating the number of jobs at risk

    A comparative analysis of machine learning models for corporate default forecasting

    Get PDF
    This study examines the potential benefits of utilizing machine learning models for default forecasting by comparing the discriminatory power of the random forest and XGBoost models with traditional statistical models. The results of the evaluation with out-of-time predictions show that the machine learning models exhibit a higher discriminatory power compared to the traditional models. The reduction in the sample size of the training dataset leads to a decrease in predictive power of the machine learning models, reducing the difference in performance between the two model types. While modifications in model dimensionality have a limited impact on the discriminatory power of the statistical models, the predictive power of machine learning models increases with the addition of further predictors. When employing a clustering approach, both traditional and machine learning models exhibit an improvement in discriminatory power in the small, medium, and large firm size clusters compared to the previous non-clustering specifications. Machine learning models exhibit a significantly higher ability to classify micro firms. The findings of this research indicate that the machine learning models exhibit superior discriminatory power compared to the traditional models across the different specifications. Machine learning models can be used to forecast the potential impact of corporate default of non-financial micro cooperations on the Portuguese labour market by estimating the number of jobs at risk

    New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset

    Get PDF
    One of the most challenging problems in the real-world dataset is the rising numbers of imbalanced data. The fact that the ratio of the majorities is higher than the minorities will lead to misleading results as conventional machine learning algorithms were designed on the assumption of equal class distribution. The purpose of this study is to build a hybrid data preprocessing approach to deal with the class imbalance issue by applying resampling approaches and CSL for fraud detection using a real-world dataset. The proposed hybrid approach consists of two steps in which the first step is to compare several resampling approaches to find the optimum technique with the highest performance in the validation set. While the second method used CSL with optimal weight ratio on the resampled data from the first step. The hybrid technique was found to have a positive impact of 0.987, 0.974, 0.847, 0.853 F2-measure for RF, DT, XGBOOST and LGBM, respectively. Additionally, relative to the conventional methods, it obtained the highest performance for prediction

    Indicadores financeiros como poderoso instrumento para prever insolvência. Um estudo usando o algoritmo boosting em empresas colombianas

    Get PDF
    This study is motivated by the importance of accurately predicting insolvency before it happens. The paper aims to develop an insolvency prediction model for Colombian firms with one, two and three years of anticipation through financial ratios, keeping sample structures and taking into account insolvency-related regulation. This research contributes to the literature because unlike many studies, it takes legislation into account, explains the different types of financial ratios, and uses boosting algorithms without biasing the sample. Data from 11,812 Colombian companies covering the period 2012-2016 was used. The results show accuracy above 70% for insolvency predic­tion with one, two and three years of anticipation.Esta investigación es motivada por la importancia de tener una buena predicción de la insolvencia con anticipación. El objetivo de este artículo es desarrollar un modelo predictivo para las empresas colombianas con uno, dos y tres años de anticipación usando indicadores financieros, conservando la estructura de la muestra original y teniendo en cuenta la regulación sobre insolvencia. Este artículo contribuye a la literatura ya que, a diferencia de los estudios tradicionales, se tienen en cuenta aspectos como la legislación, se explican los diferentes tipos de indicadores financieros y se utiliza el algoritmo boosting sin sesgar la muestra inicial. Para el desarrollo de este estudio se consideró una muestra de 11.812 empresas colombianas durante el periodo 2012-2016. Los resultados muestran una precisión superior al 70% en la predicción de la insolvencia con uno, dos y tres años de anticipación.Esta pesquisa é motivada pela importância de ter uma boa previsão de insolvência com antecedência. O objetivo deste artigo é desenvolver um modelo preditivo para as empresas colombianas com um, dois e três anos de antecedência, utilizando indicadores financeiros, preser­vando a estrutura original da amostra e levando em consideração o regulamento de insolvência. Este artigo contribui com a literatura, pois, diferentemente dos estudos tradicionais, são levados em consideração aspectos como legislação, explicando os diferentes tipos de indica­dores financeiros, e o algoritmo boosting é utilizado sem influenciar a amostra inicial. Para o desenvolvimento deste estudo, considerou-se uma amostra de 11.812 empresas colombianas durante o período 2012-2016. Os resultados mostram uma precisão superior a 70% na previsão da insolvência com um, dois e três anos de antecedência

    A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation

    Full text link
    Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets
    • …
    corecore