4,702 research outputs found

    Ensemble of Example-Dependent Cost-Sensitive Decision Trees

    Get PDF
    Several real-world classification problems are example-dependent cost-sensitive in nature, where the costs due to misclassification vary between examples and not only within classes. However, standard classification methods do not take these costs into account, and assume a constant cost of misclassification errors. In previous works, some methods that take into account the financial costs into the training of different algorithms have been proposed, with the example-dependent cost-sensitive decision tree algorithm being the one that gives the highest savings. In this paper we propose a new framework of ensembles of example-dependent cost-sensitive decision-trees. The framework consists in creating different example-dependent cost-sensitive decision trees on random subsamples of the training set, and then combining them using three different combination approaches. Moreover, we propose two new cost-sensitive combination approaches; cost-sensitive weighted voting and cost-sensitive stacking, the latter being based on the cost-sensitive logistic regression method. Finally, using five different databases, from four real-world applications: credit card fraud detection, churn modeling, credit scoring and direct marketing, we evaluate the proposed method against state-of-the-art example-dependent cost-sensitive techniques, namely, cost-proportionate sampling, Bayes minimum risk and cost-sensitive decision trees. The results show that the proposed algorithms have better results for all databases, in the sense of higher savings.Comment: 13 pages, 6 figures, Submitted for possible publicatio

    Enhancing the predictive performance of ensemble models through novel multi-objective strategies: evidence from credit risk and business model innovation survey data

    Get PDF
    This paper proposes novel multi-objective optimization strategies to develop a weighted ensemble model. The comparison of the performance of the proposed strategies against simulated data suggests that the multi-objective strategy based on joint entropy is superior to other proposed strategies. For the application, generalization, and practical implications of the proposed approaches, we implemented the model on two real datasets related to the prediction of credit risk default and the adoption of the innovative business model by firms. The scope of this paper can be extended in ordering the solutions of the proposed multi- objective strategies and can be generalized for other similar predictive task

    An insight into the experimental design for credit risk and corporate bankruptcy prediction systems

    Get PDF
    Over the last years, it has been observed an increasing interest of the finance and business communities in any application tool related to the prediction of credit and bankruptcy risk, probably due to the need of more robust decision-making systems capable of managing and analyzing complex data. As a result, plentiful techniques have been developed with the aim of producing accurate prediction models that are able to tackle these issues. However, the design of experiments to assess and compare these models has attracted little attention so far, even though it plays an important role in validating and supporting the theoretical evidence of performance. The experimental design should be done carefully for the results to hold significance; otherwise, it might be a potential source of misleading and contradictory conclusions about the benefits of using a particular prediction system. In this work, we review more than 140 papers published in refereed journals within the period 2000–2013, putting the emphasis on the bases of the experimental design in credit scoring and bankruptcy prediction applications. We provide some caveats and guidelines for the usage of databases, data splitting methods, performance evaluation metrics and hypothesis testing procedures in order to converge on a systematic, consistent validation standard.This work has partially been supported by the Mexican Science and Technology Council (CONACYT-Mexico) through a Postdoctoral Fellowship [223351], the Spanish Ministry of Economy under grant TIN2013-46522-P and the Generalitat Valenciana under grant PROMETEOII/2014/062

    Three-stage ensemble model : reinforce predictive capacity without compromising interpretability

    Get PDF
    Thesis proposal presented as partial requirement for obtaining the Master’s degree in Statistics and Information Management, with specialization in Risk Analysis and ManagementOver the last decade, several banks have developed models to quantify credit risk. In addition to the monitoring of the credit portfolio, these models also help deciding the acceptance of new contracts, assess customers profitability and define pricing strategy. The objective of this paper is to improve the approach in credit risk modeling, namely in scoring models to predict default events. To this end, we propose the development of a three-stage ensemble model that combines the results interpretability of the Scorecard with the predictive power of machine learning algorithms. The results show that ROC index improves 0.5%-0.7% and Accuracy 0%-1% considering the Scorecard as baseline

    Loan Default Prediction: A Complete Revision of LendingClub

    Get PDF
    Predicción del default: Una revisión completa de LendingClub El objetivo del estudio es determinar un modelo de predicción de default crediticio usando la base de datos de LendingClub. La metodología consiste en estimar las variables que influyen en el proceso de predicción de préstamos pagados y no pagados utilizando el algoritmo Random Forest. El algoritmo define los factores con mayor influencia sobre el pago o el impago, generando un modelo reducido a nueve predictores relacionados con el historial crediticio del prestatario y el historial de pagos dentro de la plataforma. La medición del desempeño del modelo genera un resultado F1 Macro Score con una precisión mayor al 90% de la muestra de evaluación. Las contribuciones de este estudio incluyen, el haber utilizado la base de datos completa de toda la operación de LendingClub disponible, para obtener variables trascendentales para la tarea de clasificación y predicción, que pueden ser útiles para estimar la morosidad en el mercado de préstamos de persona a persona. Podemos sacar dos conclusiones importantes, primero confirmamos la capacidad del algoritmo Random Forest para predecir problemas de clasificación binaria en base a métricas de rendimiento obtenidas y segundo, denotamos la influencia de las variables tradicionales de puntuación de crédito en los problemas de predicción por defecto.The study aims to determine a credit default prediction model using data from LendingClub. The model estimates the effect of the influential variables on the prediction process of paid and unpaid loans. We implemented the random forest algorithm to identify the variables with the most significant influence on payment or default, addressing nine predictors related to the borrower's credit and payment background. Results confirm that the model’s performance generates a F1 Macro Score that accomplishes 90% in accuracy for the evaluation sample. Contributions of this study include using the complete dataset of the entire operation of LendingClub available, to obtain transcendental variables for the classification and prediction task, which can be helpful to estimate the default in the person-to-person loan market. We can draw two important conclusions, first we confirm the Random Forest algorithm's capacity to predict binary classification problems based on performance metrics obtained and second, we denote the influence of traditional credit scoring variables on default prediction problems

    An Overview of the Use of Neural Networks for Data Mining Tasks

    Get PDF
    In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks

    Data Science for Finance: Targeted Learning from (Big) Data to Economic Stability and Financial Risk Management

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information Management, specialization in Statistics and EconometricsThe modelling, measurement, and management of systemic financial stability remains a critical issue in most countries. Policymakers, regulators, and managers depend on complex models for financial stability and risk management. The models are compelled to be robust, realistic, and consistent with all relevant available data. This requires great data disclosure, which is deemed to have the highest quality standards. However, stressed situations, financial crises, and pandemics are the source of many new risks with new requirements such as new data sources and different models. This dissertation aims to show the data quality challenges of high-risk situations such as pandemics or economic crisis and it try to theorize the new machine learning models for predictive and longitudes time series models. In the first study (Chapter Two) we analyzed and compared the quality of official datasets available for COVID-19 as a best practice for a recent high-risk situation with dramatic effects on financial stability. We used comparative statistical analysis to evaluate the accuracy of data collection by a national (Chinese Center for Disease Control and Prevention) and two international (World Health Organization; European Centre for Disease Prevention and Control) organizations based on the value of systematic measurement errors. We combined excel files, text mining techniques, and manual data entries to extract the COVID-19 data from official reports and to generate an accurate profile for comparisons. The findings show noticeable and increasing measurement errors in the three datasets as the pandemic outbreak expanded and more countries contributed data for the official repositories, raising data comparability concerns and pointing to the need for better coordination and harmonized statistical methods. The study offers a COVID-19 combined dataset and dashboard with minimum systematic measurement errors and valuable insights into the potential problems in using databanks without carefully examining the metadata and additional documentation that describe the overall context of data. In the second study (Chapter Three) we discussed credit risk as the most significant source of risk in banking as one of the most important sectors of financial institutions. We proposed a new machine learning approach for online credit scoring which is enough conservative and robust for unstable and high-risk situations. This Chapter is aimed at the case of credit scoring in risk management and presents a novel method to be used for the default prediction of high-risk branches or customers. This study uses the Kruskal-Wallis non-parametric statistic to form a conservative credit-scoring model and to study its impact on modeling performance on the benefit of the credit provider. The findings show that the new credit scoring methodology represents a reasonable coefficient of determination and a very low false-negative rate. It is computationally less expensive with high accuracy with around 18% improvement in Recall/Sensitivity. Because of the recent perspective of continued credit/behavior scoring, our study suggests using this credit score for non-traditional data sources for online loan providers to allow them to study and reveal changes in client behavior over time and choose the reliable unbanked customers, based on their application data. This is the first study that develops an online non-parametric credit scoring system, which can reselect effective features automatically for continued credit evaluation and weigh them out by their level of contribution with a good diagnostic ability. In the third study (Chapter Four) we focus on the financial stability challenges faced by insurance companies and pension schemes when managing systematic (undiversifiable) mortality and longevity risk. For this purpose, we first developed a new ensemble learning strategy for panel time-series forecasting and studied its applications to tracking respiratory disease excess mortality during the COVID-19 pandemic. The layered learning approach is a solution related to ensemble learning to address a given predictive task by different predictive models when direct mapping from inputs to outputs is not accurate. We adopt a layered learning approach to an ensemble learning strategy to solve the predictive tasks with improved predictive performance and take advantage of multiple learning processes into an ensemble model. In this proposed strategy, the appropriate holdout for each model is specified individually. Additionally, the models in the ensemble are selected by a proposed selection approach to be combined dynamically based on their predictive performance. It provides a high-performance ensemble model to automatically cope with the different kinds of time series for each panel member. For the experimental section, we studied more than twelve thousand observations in a portfolio of 61-time series (countries) of reported respiratory disease deaths with monthly sampling frequency to show the amount of improvement in predictive performance. We then compare each country’s forecasts of respiratory disease deaths generated by our model with the corresponding COVID-19 deaths in 2020. The results of this large set of experiments show that the accuracy of the ensemble model is improved noticeably by using different holdouts for different contributed time series methods based on the proposed model selection method. These improved time series models provide us proper forecasting of respiratory disease deaths for each country, exhibiting high correlation (0.94) with Covid-19 deaths in 2020. In the fourth study (Chapter Five) we used the new ensemble learning approach for time series modeling, discussed in the previous Chapter, accompany by K-means clustering for forecasting life tables in COVID-19 times. Stochastic mortality modeling plays a critical role in public pension design, population and public health projections, and in the design, pricing, and risk management of life insurance contracts and longevity-linked securities. There is no general method to forecast the mortality rate applicable to all situations especially for unusual years such as the COVID-19 pandemic. In this Chapter, we investigate the feasibility of using an ensemble of traditional and machine learning time series methods to empower forecasts of age-specific mortality rates for groups of countries that share common longevity trends. We use Generalized Age-Period-Cohort stochastic mortality models to capture age and period effects, apply K-means clustering to time series to group countries following common longevity trends, and use ensemble learning to forecast life expectancy and annuity prices by age and sex. To calibrate models, we use data for 14 European countries from 1960 to 2018. The results show that the ensemble method presents the best robust results overall with minimum RMSE in the presence of structural changes in the shape of time series at the time of COVID-19. In this dissertation’s conclusions (Chapter Six), we provide more detailed insights about the overall contributions of this dissertation on the financial stability and risk management by data science, opportunities, limitations, and avenues for future research about the application of data science in finance and economy

    Machine Learning applied to credit risk assessment: Prediction of loan defaults

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceDue to the recent financial crisis and regulatory concerns of Basel II, credit risk assessment is becoming a very important topic in the field of financial risk management. Financial institutions need to take great care when dealing with consumer loans in order to avoid losses and costs of opportunity. For this matter, credit scoring systems have been used to make informed decisions on whether or not to grant credit to clients who apply to them. Until now several credit scoring models have been proposed, from statistical models, to more complex artificial intelligence techniques. However, most of previous work is focused on employing single classifiers. Ensemble learning is a powerful machine learning paradigm which has proven to be of great value in solving a variety of problems. This study compares the performance of the industry standard, logistic regression, to four ensemble methods, i.e. AdaBoost, Gradient Boosting, Random Forest and Stacking in identifying potential loan defaults. All the models were built with a real world dataset with over one million customers from Lending Club, a financial institution based in the United States. The performance of the models was compared by using the Hold-out method as the evaluation design and accuracy, AUC, type I error and type II error as evaluation metrics. Experimental results reveal that the ensemble classifiers were able to outperform logistic regression on three key indicators, i.e. accuracy, type I error and type II error. AdaBoost performed better than the remaining classifiers considering a trade off between all the metrics evaluated. The main contribution of this thesis is an experimental addition to the literature on the preferred models for predicting potential loan defaulters

    A literature review on the application of evolutionary computing to credit scoring

    Get PDF
    The last years have seen the development of many credit scoring models for assessing the creditworthiness of loan applicants. Traditional credit scoring methodology has involved the use of statistical and mathematical programming techniques such as discriminant analysis, linear and logistic regression, linear and quadratic programming, or decision trees. However, the importance of credit grant decisions for financial institutions has caused growing interest in using a variety of computational intelligence techniques. This paper concentrates on evolutionary computing, which is viewed as one of the most promising paradigms of computational intelligence. Taking into account the synergistic relationship between the communities of Economics and Computer Science, the aim of this paper is to summarize the most recent developments in the application of evolutionary algorithms to credit scoring by means of a thorough review of scientific articles published during the period 2000–2012.This work has partially been supported by the Spanish Ministry of Education and Science under grant TIN2009-14205 and the Generalitat Valenciana under grant PROMETEO/2010/028

    How Secure Are Good Loans: Validating Loan-Granting Decisions And Predicting Default Rates On Consumer Loans

    Get PDF
    The failure or success of the banking industry depends largely on the industrys ability to properly evaluate credit risk. In the consumer-lending context, the banks goal is to maximize income by issuing as many good loans to consumers as possible while avoiding losses associated with bad loans. Mistakes could severely affect profits because the losses associated with one bad loan may undermine the income earned on many good loans. Therefore banks carefully evaluate the financial status of each customer as well as their credit worthiness and weigh them against the banks internal loan-granting policies. Recognizing that even a small improvement in credit scoring accuracy translates into significant future savings, the banking industry and the scientific community have been employing various machine learning and traditional statistical techniques to improve credit risk prediction accuracy.This paper examines historical data from consumer loans issued by a financial institution to individuals that the financial institution deemed to be qualified customers. The data consists of the financial attributes of each customer and includes a mixture of loans that the customers paid off and defaulted upon. The paper uses three different data mining techniques (decision trees, neural networks, logit regression) and the ensemble model, which combines the three techniques, to predict whether a particular customer defaulted or paid off his/her loan. The paper then compares the effectiveness of each technique and analyzes the risk of default inherent in each loan and group of loans. The data mining classification techniques and analysis can enable banks to more precisely classify consumers into various credit risk groups. Knowing what risk group a consumer falls into would allow a bank to fine tune its lending policies by recognizing high risk groups of consumers to whom loans should not be issued, and identifying safer loans that should be issued, on terms commensurate with the risk of default
    corecore