1,963 research outputs found
Recommended from our members
Predictive Modelling for Loan Defaults
In this paper we explore how predictive modelling can be applied in loan default prediction. The issue of predicting the outcome of a loan to be fully paid or defaulted is one of binary classification. We explore the use of different machine learning models and their performance, namely, logistic regression, random forest, neural network, extreme gradient boost and ensemble. Additionally, as is the case with many industry data, class imbalance is an issue and data as cannot be used as such in a model otherwise the model will suffer from bias. In order to solve this issue, we explore the use of sampling techniques, such as SMOTE and ADASYN, and cost sensitive learning techniques, such as class weights. Finally, using precision, recall, G-mean, and F-measure as well as precision and recall curve AUC to examine the results of each model, it was found that there is no balancing method that is consistently superior. While all models performed well after applying a balancing method, the XGBoost with class weights model performed the best. With a robust model, there are potential opportunities for it to be leveraged in optimizing profits to produce a greater return on investment. Using the best model, return on investment was able to be improved by 83%
A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition
Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high
prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques—oversampling, under-sampling and synthetic minority over-sampling (SMOTE)—along with four popular classification methods—logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates
Basel II compliant credit risk modelling: model development for imbalanced credit scoring data sets, loss given default (LGD) and exposure at default (EAD)
The purpose of this thesis is to determine and to better inform industry practitioners to the most appropriate classification and regression techniques for modelling the three key credit risk components of the Basel II minimum capital requirement; probability of default (PD), loss given default (LGD), and exposure at default (EAD). The Basel II accord regulates risk and capital management requirements to ensure that a bank holds enough capital proportional to the exposed risk of its lending practices. Under the advanced internal ratings based (IRB) approach Basel II allows banks to develop their own empirical models based on historical data for each of PD, LGD and EAD.In this thesis, first the issue of imbalanced credit scoring data sets, a special case of PD modelling where the number of defaulting observations in a data set is much lower than the number of observations that do not default, is identified, and the suitability of various classification techniques are analysed and presented. As well as using traditional classification techniques this thesis also explores the suitability of gradient boosting, least square support vector machines and random forests as a form of classification. The second part of this thesis focuses on the prediction of LGD, which measures the economic loss, expressed as a percentage of the exposure, in case of default. In this thesis, various state-of-the-art regression techniques to model LGD are considered. In the final part of this thesis we investigate models for predicting the exposure at default (EAD). For off-balance-sheet items (for example credit cards) to calculate the EAD one requires the committed but unused loan amount times a credit conversion factor (CCF). Ordinary least squares (OLS), logistic and cumulative logistic regression models are analysed, as well as an OLS with Beta transformation model, with the main aim of finding the most robust and comprehensible model for the prediction of the CCF. Also a direct estimation of EAD, using an OLS model, will be analysed. All the models built and presented in this thesis have been applied to real-life data sets from major global banking institutions
Feature selection in credit risk modeling: an international evidence
This paper aims to discover a suitable combination of contemporary feature selection techniques and robust prediction classifiers.
As such, to examine the impact of the feature selection method
on classifier performance, we use two Chinese and three other
real-world credit scoring datasets. The utilized feature selection
methods are the least absolute shrinkage and selection operator
(LASSO), multivariate adaptive regression splines (MARS). In contrast, the examined classifiers are the classification and regression
trees (CART), logistic regression (LR), artificial neural network
(ANN), and support vector machines (SVM). Empirical findings
confirm that LASSO’s feature selection method, followed by
robust classifier SVM, demonstrates remarkable improvement and
outperforms other competitive classifiers. Moreover, ANN also
offers improved accuracy with feature selection methods; LR only
can improve classification efficiency through performing feature
selection via LASSO. Nonetheless, CART does not provide any
indication of improvement in any combination. The proposed
credit scoring modeling strategy may use to develop policy, progressive ideas, operational guidelines for effective credit risk management of lending, and other financial institutions. The finding
of this study has practical value, as to date, there is no consensus
about the combination of feature selection method and prediction classifiers
Credit Risk Management Using Automatic Machine Learning
The article presents the basic techniques of data mining implemented in typical commercial software. They were used to assess the risk of credit card debt repayment. The article assesses the quality of classification models derived from data mining techniques and compares their results with the traditional approach using a logit model to assess credit risk. It turns out that data mining models provide similar accuracy of classification compared to the logit model, but they require much less work and facilitate the automation of the process of building scoring models
Low-Default Portfolio/One-Class Classification: A Literature Review
Consider a bank which wishes to decide whether a credit applicant will obtain credit or not. The bank has to assess if the applicant will be able to redeem the credit. This is done by estimating the probability that the applicant will default prior to the maturity of the credit. To estimate this probability of default it is first necessary to identify criteria which separate the good from the bad creditors, such as loan amount and age or factors concerning the income of the applicant. The question then arises of how a bank identifies a sufficient number of selective criteria that possess the necessary discriminatory power. As a solution, many traditional binary classification methods have been proposed with varying degrees of success. However, a particular problem with credit scoring is that defaults are only observed for a small subsample of applicants. An imbalance exists between the ratio of non-defaulters to defaulters. This has an adverse effect on the aforementioned binary classification method. Recently one-class classification approaches have been proposed to address the imbalance problem. The purpose of this literature review is three fold: (I) present the reader with an overview of credit scoring; (ii) review existing binary classification approaches; and (iii) introduce and examine one-class classification approaches
Improving Risk Predictions by Preprocessing Imbalanced Credit Data
Imbalanced credit data sets refer to databases in which the class of defaulters is heavily under-represented in comparison to the class of non-defaulters. This is a very common situation in real-life credit scoring applications, but it has still received little attention. This paper investigates whether data resampling can be used to improve the performance of learners built from imbalanced credit data sets, and whether the effectiveness of resampling is related to the type of classifier. Experimental results demonstrate that learning with the resampled sets consistently outperforms the use of the original imbalanced credit data, independently of the classifier used
- …