1,211 research outputs found

    Credit scoring: comparison of nonā€parametric techniques against logistic regression

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceOver the past decades, financial institutions have been giving increased importance to credit risk management as a critical tool to control their profitability. More than ever, it became crucial for these institutions to be able to well discriminate between good and bad clients for only accepting the credit applications that are not likely to default. To calculate the probability of default of a particular client, most financial institutions have credit scoring models based on parametric techniques. Logistic regression is the current industry standard technique in credit scoring models, and it is one of the techniques under study in this dissertation. Although it is regarded as a robust and intuitive technique, it is still not free from several critics towards the model assumptions it takes that can compromise its predictions. This dissertation intends to evaluate the gains in performance resulting from using more modern non-parametric techniques instead of logistic regression, performing a model comparison over four different real-life credit datasets. Specifically, the techniques compared against logistic regression in this study consist of two single classifiers (decision tree and SVM with RBF kernel) and two ensemble methods (random forest and stacking with cross-validation). The literature review demonstrates that heterogeneous ensemble approaches have a weaker presence in credit scoring studies and, because of that, stacking with cross-validation was considered in this study. The results demonstrate that logistic regression outperforms the decision tree classifier, has similar performance in relation to SVM and slightly underperforms both ensemble approaches in similar extents

    Low-Default Portfolio/One-Class Classification: A Literature Review

    Get PDF
    Consider a bank which wishes to decide whether a credit applicant will obtain credit or not. The bank has to assess if the applicant will be able to redeem the credit. This is done by estimating the probability that the applicant will default prior to the maturity of the credit. To estimate this probability of default it is first necessary to identify criteria which separate the good from the bad creditors, such as loan amount and age or factors concerning the income of the applicant. The question then arises of how a bank identifies a sufficient number of selective criteria that possess the necessary discriminatory power. As a solution, many traditional binary classification methods have been proposed with varying degrees of success. However, a particular problem with credit scoring is that defaults are only observed for a small subsample of applicants. An imbalance exists between the ratio of non-defaulters to defaulters. This has an adverse effect on the aforementioned binary classification method. Recently one-class classification approaches have been proposed to address the imbalance problem. The purpose of this literature review is three fold: (I) present the reader with an overview of credit scoring; (ii) review existing binary classification approaches; and (iii) introduce and examine one-class classification approaches

    Using multiple classifiers for predicting the risk of endovascular aortic aneurysm repair re-intervention through hybrid feature selection.

    Get PDF
    Feature selection is essential in medical area; however, its process becomes complicated with the presence of censoring which is the unique character of survival analysis. Most survival feature selection methods are based on Cox's proportional hazard model, though machine learning classifiers are preferred. They are less employed in survival analysis due to censoring which prevents them from directly being used to survival data. Among the few work that employed machine learning classifiers, partial logistic artificial neural network with auto-relevance determination is a well-known method that deals with censoring and perform feature selection for survival data. However, it depends on data replication to handle censoring which leads to unbalanced and biased prediction results especially in highly censored data. Other methods cannot deal with high censoring. Therefore, in this article, a new hybrid feature selection method is proposed which presents a solution to high level censoring. It combines support vector machine, neural network, and K-nearest neighbor classifiers using simple majority voting and a new weighted majority voting method based on survival metric to construct a multiple classifier system. The new hybrid feature selection process uses multiple classifier system as a wrapper method and merges it with iterated feature ranking filter method to further reduce features. Two endovascular aortic repair datasets containing 91% censored patients collected from two centers were used to construct a multicenter study to evaluate the performance of the proposed approach. The results showed the proposed technique outperformed individual classifiers and variable selection methods based on Cox's model such as Akaike and Bayesian information criterions and least absolute shrinkage and selector operator in p values of the log-rank test, sensitivity, and concordance index. This indicates that the proposed classifier is more powerful in correctly predicting the risk of re-intervention enabling doctor in selecting patients' future follow-up plan

    Improved credit scoring model using XGBoost with Bayesian hyper-parameter optimization

    Get PDF
    Several credit-scoring models have been developed using ensemble classifiers in order to improve the accuracy of assessment. However, among the ensemble models, little consideration has been focused on the hyper-parameters tuning of base learners, although these are crucial to constructing ensemble models. This study proposes an improved credit scoring model based on the extreme gradient boosting (XGB) classifier using Bayesian hyper-parameters optimization (XGB-BO). The model comprises two steps. Firstly, data pre-processing is utilized to handle missing values and scale the data. Secondly, Bayesian hyper-parameter optimization is applied to tune the hyper-parameters of the XGB classifier and used to train the model. The model is evaluated on four widely public datasets, i.e., the German, Australia, lending club, and Polish datasets. Several state-of-the-art classification algorithms are implemented for predictive comparison with the proposed method. The results of the proposed model showed promising results, with an improvement in accuracy of 4.10%, 3.03%, and 2.76% on the German, lending club, and Australian datasets, respectively. The proposed model outperformed commonly used techniques, e.g., decision tree, support vector machine, neural network, logistic regression, random forest, and bagging, according to the evaluation results. The experimental results confirmed that the XGB-BO model is suitable for assessing the creditworthiness of applicants

    A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition

    Get PDF
    Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniquesā€”oversampling, under-sampling and synthetic minority over-sampling (SMOTE)ā€”along with four popular classification methodsā€”logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates

    Bankruptcy prediction model using cost-sensitive extreme gradient boosting in the context of imbalanced datasets

    Get PDF
    In the process of bankruptcy prediction models, a class imbalanced problem has occurred which limits the performance of the models. Most prior research addressed the problem by applying resampling methods such as the synthetic minority oversampling technique (SMOTE). However, resampling methods lead to other issues, e.g., increasing noisy data and training time during the process. To improve the bankruptcy prediction model, we propose cost-sensitive extreme gradient boosting (CS-XGB) to address the class imbalanced problem without requiring any resampling method. The proposed methodā€™s effectiveness is evaluated on six real-world datasets, i.e., the LendingClub, and five Polish companiesā€™ bankruptcy. This research compares the performance of CS-XGB with other ensemble methods, including SMOTE-XGB which applies SMOTE to the training set before the learning process. The experimental results show that i) based on LendingClub, the CS-XGB improves the performance of XGBoost and SMOTE-XGB by more than 50% and 33% on bankruptcy detection rate (BDR) and geometric mean (GM), respectively, and ii) the CS-XGB model outperforms random forest (RF), Bagging, AdaBoost, XGBoost, and SMOTE-XGB in terms of BDR, GM, and the area under a receiver operating characteristic curve (AUC) based on the five Polish datasets. Besides, the CS-XGB model achieves good overall prediction results

    Credit Risk Scoring: A Stacking Generalization Approach

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Risk Analysis and ManagementCredit risk regulation has been receiving tremendous attention, as a result of the effects of the latest global financial crisis. According to the developments made in the Internal Rating Based approach, under the Basel guidelines, banks are allowed to use internal risk measures as key drivers to assess the possibility to grant a loan to an applicant. Credit scoring is a statistical approach used for evaluating potential loan applications in both financial and banking institutions. When applying for a loan, an applicant must fill out an application form detailing its characteristics (e.g., income, marital status, and loan purpose) that will serve as contributions to a credit scoring model which produces a score that is used to determine whether a loan should be granted or not. This enables faster and consistent credit approvals and the reduction of bad debt. Currently, many machine learning and statistical approaches such as logistic regression and tree-based algorithms have been used individually for credit scoring models. Newer contemporary machine learning techniques can outperform classic methods by simply combining models. This dissertation intends to be an empirical study on a publicly available bank loan dataset to study banking loan default, using ensemble-based techniques to increase model robustness and predictive power. The proposed ensemble method is based on stacking generalization an extension of various preceding studies that used different techniques to further enhance the model predictive capabilities. The results show that combining different models provides a great deal of flexibility to credit scoring models
    • ā€¦
    corecore