Search CORE

574 research outputs found

Improved credit scoring model using XGBoost with Bayesian hyper-parameter optimization

Author: Srivihok Anongnart
Wattuya Pakaket
Yotsawat Wirot
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/12/2021
Field of study

Several credit-scoring models have been developed using ensemble classifiers in order to improve the accuracy of assessment. However, among the ensemble models, little consideration has been focused on the hyper-parameters tuning of base learners, although these are crucial to constructing ensemble models. This study proposes an improved credit scoring model based on the extreme gradient boosting (XGB) classifier using Bayesian hyper-parameters optimization (XGB-BO). The model comprises two steps. Firstly, data pre-processing is utilized to handle missing values and scale the data. Secondly, Bayesian hyper-parameter optimization is applied to tune the hyper-parameters of the XGB classifier and used to train the model. The model is evaluated on four widely public datasets, i.e., the German, Australia, lending club, and Polish datasets. Several state-of-the-art classification algorithms are implemented for predictive comparison with the proposed method. The results of the proposed model showed promising results, with an improvement in accuracy of 4.10%, 3.03%, and 2.76% on the German, lending club, and Australian datasets, respectively. The proposed model outperformed commonly used techniques, e.g., decision tree, support vector machine, neural network, logistic regression, random forest, and bagging, according to the evaluation results. The experimental results confirmed that the XGB-BO model is suitable for assessing the creditworthiness of applicants

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

Credit Fraud Risk Detection Based on XGBoost-LR Hybrid Model

Author: Ji Zijian
Wang Maoguang
Yu Jiayu
Publication venue: AIS Electronic Library (AISeL)
Publication date: 06/12/2018
Field of study

For a long time, the credit business has been the main business of banks and financial institutions. With the rapid growth of business scale, how to use models to detect fraud risk quickly and automatically is a hot research direction. Logistic regression has become the most widely used risk assessment model in the industry due to its good robustness and strong interpretability, but it relies on differentiated features and feature combinations. XGBoost is a powerful and convenient algorithm for feature transformation. Therefore, in this paper, XGBOOST can be used to effectively perform the advantages of feature combination, and a XGBoost-LR hybrid model is constructed. Firstly, use the data to train a XGBoost model, then give the samples in the training data to the XGBoost model to get the leaves nodes of the sample, and then use the leaves nodes after one-hot encoding as a feature to train an LR model. Using the German credit data set published by UCI to verify this model and compare AUC values with other models. The results show that this hybrid model can effectively improve the accuracy of model prediction and has a good application value

AIS Electronic Library (AISeL)

Automatic Selection of MapReduce Machine Learning Algorithms: A Model Building Approach

Author: Franklin Fraggle
Publication venue: Digital Commons @ Michigan Tech
Publication date: 01/01/2018
Field of study

As the amount of information available for data mining grows larger, the amount of time needed to train models on those huge volumes of data also grows longer. Techniques such as sub-sampling and parallel algorithms have been employed to deal with this growth. Some studies have shown that sub-sampling can have adverse effects on the quality of models produced, and the degree to which it affects different types of learning algorithms varies. Parallel algorithms perform well when enough computing resources (e.g. cores, memory) are available, however for a limited sized cluster the growth in data will still cause an unacceptable growth in model training time. In addition to the data size mitigation problem, picking which algorithms are well suited to a particular dataset, can be a challenge. While some studies have looked at selection criteria for picking a learning algorithm based on the properties of the dataset, the additional complexity of parallel learners or possible run time limitations has not been considered. This study explores run time and model quality results of various techniques for dealing with large datasets, including using different numbers of compute cores, sub-sampling the datasets, and exploiting the iterative anytime nature of the training algorithms. The algorithms were studied using MapReduce implementations of four supervised learning algorithms, logistic regression, tree induction, bagged trees, and boosted stumps for binary classification using probabilistic models. Evaluation of these techniques was done using a modified form of learning curves which has a temporal component. Finally, the data collected was used to train a set of models to predict which type of parallel learner best suits a particular dataset, given run time limitations and the number of compute cores to be used. The predictions of those models were then compared to the actual results of running the algorithms on the datasets they were attempting to predict

Michigan Technological University

Design an Optimal Decision Tree based Algorithm to Improve Model Prediction Performance

Author: Pathan Shabana
Sharma Sanjeev Kumar
Publication venue: Auricle Global Society of Education and Research
Publication date: 22/07/2023
Field of study

Performance of decision trees is assessed by prediction accuracy for unobserved occurrences. In order to generate optimised decision trees with high classification accuracy and smaller decision trees, this study will pre-process the data. In this study, some decision tree components are addressed and enhanced. The algorithms should produce precise and ideal decision trees in order to increase prediction performance. Additionally, it hopes to create a decision tree algorithm with a tiny global footprint and excellent forecast accuracy. The typical decision tree-based technique was created for classification purposes and is used with various kinds of uncertain information. Prior to preparing the dataset for classification, the uncertain dataset was first processed through missing data treatment and other uncertainty handling procedures to produce the balanced dataset. Three different real-time datasets, including the Titanic dataset, the PIMA Indian Diabetes dataset, and datasets relating to heart disease, have been used to test the proposed algorithm. The suggested algorithm's performance has been assessed in terms of the precision, recall, f-measure, and accuracy metrics. The outcomes of suggested decision tree and the standard decision tree have been contrasted. On all three datasets, it was found that the decision tree with Gini impurity optimization performed remarkably well

International Journal on Recent and Innovation Trends in Computing and Communication

Improving the Validity of Decision Trees as Explanations

Author: Marecek Jakub
Nemecek Jiri
Pevny Tomas
Publication venue
Publication date: 13/06/2023
Field of study

In classification and forecasting with tabular data, one often utilizes tree-based models. This can be competitive with deep neural networks on tabular data [cf. Grinsztajn et al., NeurIPS 2022, arXiv:2207.08815] and, under some conditions, explainable. The explainability depends on the depth of the tree and the accuracy in each leaf of the tree. Here, we train a low-depth tree with the objective of minimising the maximum misclassification error across each leaf node, and then ``suspend'' further tree-based models (e.g., trees of unlimited depth) from each leaf of the low-depth tree. The low-depth tree is easily explainable, while the overall statistical performance of the combined low-depth and suspended tree-based models improves upon decision trees of unlimited depth trained using classical methods (e.g., CART) and is comparable to state-of-the-art methods (e.g., well-tuned XGBoost)

arXiv.org e-Print Archive

A dynamic credit scoring model based on survival gradient boosting decision tree approach

Author: Lingyun He
Yating Fu
Yinguo Li
Yixin Xu
Yufei Xia
Publication venue: 'Vilnius Gediminas Technical University'
Publication date: 01/01/2021
Field of study

Credit scoring, which is typically transformed into a classification problem, is a powerful tool to manage credit risk since it forecasts the probability of default (PD) of a loan application. However, there is a growing trend of integrating survival analysis into credit scoring to provide a dynamic prediction on PD over time and a clear explanation on censoring. A novel dynamic credit scoring model (i.e., SurvXGBoost) is proposed based on survival gradient boosting decision tree (GBDT) approach. Our proposal, which combines survival analysis and GBDT approach, is expected to enhance predictability relative to statistical survival models. The proposed method is compared with several common benchmark models on a real-world consumer loan dataset. The results of out-of-sample and out-of-time validation indicate that SurvXGBoost outperform the benchmarks in terms of predictability and misclassification cost. The incorporation of macroeconomic variables can further enhance performance of survival models. The proposed SurvXGBoost meanwhile maintains some interpretability since it provides information on feature importance. First published online 14 December 202

Directory of Open Access Journals

VGTU Journals (Vilnius Gediminas Technical University - Vilnius Tech)