272 research outputs found
Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms
We aim at developing and improving the imbalanced business risk modeling via jointly using proper evaluation criteria, resampling, cross-validation, classifier regularization, and ensembling techniques. Area Under the Receiver Operating Characteristic Curve (AUC of ROC) is used for model comparison based on 10-fold cross-validation. Two undersampling strategies including random undersampling (RUS) and cluster centroid undersampling (CCUS), as well as two oversampling methods including random oversampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE), are applied. Three highly interpretable classifiers, including logistic regression without regularization (LR), L1-regularized LR (L1LR), and decision tree (DT) are implemented. Two ensembling techniques, including Bagging and Boosting, are applied to the DT classifier for further model improvement. The results show that Boosting on DT by using the oversampled data containing 50% positives via SMOTE is the optimal model and it can achieve AUC, recall, and F1 score valued 0.8633, 0.9260, and 0.8907, respectively
Leveraging augmentation techniques for tasks with unbalancedness within the financial domain: a two-level ensemble approach
Modern financial markets produce massive datasets that need to be analysed using new modelling techniques like those from (deep) Machine Learning and Artificial Intelligence. The common goal of these techniques is to forecast the behaviour of the market, which can be translated into various classification tasks, such as, for instance, predicting the likelihood of companies’ bankruptcy or in fraud detection systems. However, it is often the case that real-world financial data are unbalanced, meaning that the classes’ distribution is not equally represented in such datasets. This gives the main issue since any Machine Learning model is trained according to the majority class mainly, leading to inaccurate predictions. In this paper, we explore different data augmentation techniques to deal with very unbalanced financial data. We consider a number of publicly available datasets, then apply state-of-the-art augmentation strategies to them, and finally evaluate the results for several Machine Learning models trained on the sampled data. The performance of the various approaches is evaluated according to their accuracy, micro, and macro F1 score, and finally by analyzing the precision and recall over the minority class. We show that a consistent and accurate improvement is achieved when data augmentation is employed. The obtained classification results look promising and indicate the efficiency of augmentation strategies on financial tasks. On the basis of these results, we present an approach focused on classification tasks within the financial domain that takes a dataset as input, identifies what kind of augmentation technique to use, and then applies an ensemble of all the augmentation techniques of the identified type to the input dataset along with an ensemble of different methods to tackle the underlying classification
A comparative analysis of machine learning models for corporate default forecasting
This study examines the potential benefits of utilizing machine learning models for
default forecasting by comparing the discriminatory power of the random forest and XGBoost
models with traditional statistical models. The results of the evaluation with out-of-time
predictions show that the machine learning models exhibit a higher discriminatory power
compared to the traditional models. The reduction in the sample size of the training dataset
leads to a decrease in predictive power of the machine learning models, reducing the difference
in performance between the two model types. While modifications in model dimensionality
have a limited impact on the discriminatory power of the statistical models, the predictive power
of machine learning models increases with the addition of further predictors. When employing
a clustering approach, both traditional and machine learning models exhibit an improvement in
discriminatory power in the small, medium, and large firm size clusters compared to the
previous non-clustering specifications. Machine learning models exhibit a significantly higher
ability to classify micro firms. The findings of this research indicate that the machine learning
models exhibit superior discriminatory power compared to the traditional models across the
different specifications. Machine learning models can be used to forecast the potential impact
of corporate default of non-financial micro cooperations on the Portuguese labour market by
estimating the number of jobs at risk
A comparative analysis of machine learning models for corporate default forecasting
This study examines the potential benefits of utilizing machine learning models for
default forecasting by comparing the discriminatory power of the random forest and XGBoost
models with traditional statistical models. The results of the evaluation with out-of-time
predictions show that the machine learning models exhibit a higher discriminatory power
compared to the traditional models. The reduction in the sample size of the training dataset
leads to a decrease in predictive power of the machine learning models, reducing the difference
in performance between the two model types. While modifications in model dimensionality
have a limited impact on the discriminatory power of the statistical models, the predictive power
of machine learning models increases with the addition of further predictors. When employing
a clustering approach, both traditional and machine learning models exhibit an improvement in
discriminatory power in the small, medium, and large firm size clusters compared to the
previous non-clustering specifications. Machine learning models exhibit a significantly higher
ability to classify micro firms. The findings of this research indicate that the machine learning
models exhibit superior discriminatory power compared to the traditional models across the
different specifications. Machine learning models can be used to forecast the potential impact
of corporate default of non-financial micro cooperations on the Portuguese labour market by
estimating the number of jobs at risk
New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset
One of the most challenging problems in the real-world dataset is the rising numbers of imbalanced data. The fact that the ratio of the majorities is higher than the minorities will lead to misleading results as conventional machine learning algorithms were designed on the assumption of equal class distribution. The purpose of this study is to build a hybrid data preprocessing approach to deal with the class imbalance issue by applying resampling approaches and CSL for fraud detection using a real-world dataset. The proposed hybrid approach consists of two steps in which the first step is to compare several resampling approaches to find the optimum technique with the highest performance in the validation set. While the second method used CSL with optimal weight ratio on the resampled data from the first step. The hybrid technique was found to have a positive impact of 0.987, 0.974, 0.847, 0.853 F2-measure for RF, DT, XGBOOST and LGBM, respectively. Additionally, relative to the conventional methods, it obtained the highest performance for prediction
Indicadores financeiros como poderoso instrumento para prever insolvência. Um estudo usando o algoritmo boosting em empresas colombianas
This study is motivated by the importance of accurately predicting insolvency before it happens. The paper aims to develop an insolvency prediction model for Colombian firms with one, two and three years of anticipation through financial ratios, keeping sample structures and taking into account insolvency-related regulation. This research contributes to the literature because unlike many studies, it takes legislation into account, explains the different types of financial ratios, and uses boosting algorithms without biasing the sample. Data from 11,812 Colombian companies covering the period 2012-2016 was used. The results show accuracy above 70% for insolvency predicÂtion with one, two and three years of anticipation.Esta investigación es motivada por la importancia de tener una buena predicción de la insolvencia con anticipación. El objetivo de este artÃculo es desarrollar un modelo predictivo para las empresas colombianas con uno, dos y tres años de anticipación usando indicadores financieros, conservando la estructura de la muestra original y teniendo en cuenta la regulación sobre insolvencia. Este artÃculo contribuye a la literatura ya que, a diferencia de los estudios tradicionales, se tienen en cuenta aspectos como la legislación, se explican los diferentes tipos de indicadores financieros y se utiliza el algoritmo boosting sin sesgar la muestra inicial. Para el desarrollo de este estudio se consideró una muestra de 11.812 empresas colombianas durante el periodo 2012-2016. Los resultados muestran una precisión superior al 70% en la predicción de la insolvencia con uno, dos y tres años de anticipación.Esta pesquisa é motivada pela importância de ter uma boa previsão de insolvência com antecedência. O objetivo deste artigo é desenvolver um modelo preditivo para as empresas colombianas com um, dois e três anos de antecedência, utilizando indicadores financeiros, preserÂvando a estrutura original da amostra e levando em consideração o regulamento de insolvência. Este artigo contribui com a literatura, pois, diferentemente dos estudos tradicionais, são levados em consideração aspectos como legislação, explicando os diferentes tipos de indicaÂdores financeiros, e o algoritmo boosting é utilizado sem influenciar a amostra inicial. Para o desenvolvimento deste estudo, considerou-se uma amostra de 11.812 empresas colombianas durante o perÃodo 2012-2016. Os resultados mostram uma precisão superior a 70% na previsão da insolvência com um, dois e três anos de antecedência
A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation
Class imbalance (CI) in classification problems arises when the number of
observations belonging to one class is lower than the other. Ensemble learning
combines multiple models to obtain a robust model and has been prominently used
with data augmentation methods to address class imbalance problems. In the last
decade, a number of strategies have been added to enhance ensemble learning and
data augmentation methods, along with new methods such as generative
adversarial networks (GANs). A combination of these has been applied in many
studies, and the evaluation of different combinations would enable a better
understanding and guidance for different application domains. In this paper, we
present a computational study to evaluate data augmentation and ensemble
learning methods used to address prominent benchmark CI problems. We present a
general framework that evaluates 9 data augmentation and 9 ensemble learning
methods for CI problems. Our objective is to identify the most effective
combination for improving classification performance on imbalanced datasets.
The results indicate that combinations of data augmentation methods with
ensemble learning can significantly improve classification performance on
imbalanced datasets. We find that traditional data augmentation methods such as
the synthetic minority oversampling technique (SMOTE) and random oversampling
(ROS) are not only better in performance for selected CI problems, but also
computationally less expensive than GANs. Our study is vital for the
development of novel models for handling imbalanced datasets
- …