240 research outputs found

    Machine learning approach for credit score analysis : a case study of predicting mortgage loan defaults

    Get PDF
    Dissertation submitted in partial fulfilment of the requirements for the degree of Statistics and Information Management specialized in Risk Analysis and ManagementTo effectively manage credit score analysis, financial institutions instigated techniques and models that are mainly designed for the purpose of improving the process assessing creditworthiness during the credit evaluation process. The foremost objective is to discriminate their clients – borrowers – to fall either in the non-defaulter group, that is more likely to pay their financial obligations, or the defaulter one which has a higher probability of failing to pay their debts. In this paper, we devote to use machine learning models in the prediction of mortgage defaults. This study employs various single classification machine learning methodologies including Logistic Regression, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine. To further improve the predictive power, a meta-algorithm ensemble approach – stacking – will be introduced to combine the outputs – probabilities – of the afore mentioned methods. The sample for this study is solely based on the publicly provided dataset by Freddie Mac. By modelling this approach, we achieve an improvement in the model predictability performance. We then compare the performance of each model, and the meta-learner, by plotting the ROC Curve and computing the AUC rate. This study is an extension of various preceding studies that used different techniques to further enhance the model predictivity. Finally, our results are compared with work from different authors.Para gerir com eficácia a análise de risco de crédito, as instituições financeiras desenvolveram técnicas e modelos que foram projetados principalmente para melhorar o processo de avaliação da qualidade de crédito durante o processo de avaliação de crédito. O objetivo final é classifica os seus clientes - tomadores de empréstimos - entre aqueles que tem maior probabilidade de pagar suas obrigações financeiras, e os potenciais incumpridores que têm maior probabilidade de entrar em default. Neste artigo, nos dedicamos a usar modelos de aprendizado de máquina na previsão de defaults de hipoteca. Este estudo emprega várias metodologias de aprendizado de máquina de classificação única, incluindo Regressão Logística, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine. Para melhorar ainda mais o poder preditivo, a abordagem do conjunto de meta-algoritmos - stacking - será introduzida para combinar as saídas - probabilidades - dos métodos acima mencionados. A amostra deste estudo é baseada exclusivamente no conjunto de dados fornecido publicamente pela Freddie Mac. Ao modelar essa abordagem, alcançamos uma melhoria no desempenho do modelo de previsibilidade. Em seguida, comparamos o desempenho de cada modelo e o meta-aprendiz, plotando a Curva ROC e calculando a taxa de AUC. Este estudo é uma extensão de vários estudos anteriores que usaram diferentes técnicas para melhorar ainda mais o modelo preditivo. Finalmente, nossos resultados são comparados com trabalhos de diferentes autores

    Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

    Get PDF
    We aim at developing and improving the imbalanced business risk modeling via jointly using proper evaluation criteria, resampling, cross-validation, classifier regularization, and ensembling techniques. Area Under the Receiver Operating Characteristic Curve (AUC of ROC) is used for model comparison based on 10-fold cross-validation. Two undersampling strategies including random undersampling (RUS) and cluster centroid undersampling (CCUS), as well as two oversampling methods including random oversampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE), are applied. Three highly interpretable classifiers, including logistic regression without regularization (LR), L1-regularized LR (L1LR), and decision tree (DT) are implemented. Two ensembling techniques, including Bagging and Boosting, are applied to the DT classifier for further model improvement. The results show that Boosting on DT by using the oversampled data containing 50% positives via SMOTE is the optimal model and it can achieve AUC, recall, and F1 score valued 0.8633, 0.9260, and 0.8907, respectively

    Credit Risk Scoring: A Stacking Generalization Approach

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Risk Analysis and ManagementCredit risk regulation has been receiving tremendous attention, as a result of the effects of the latest global financial crisis. According to the developments made in the Internal Rating Based approach, under the Basel guidelines, banks are allowed to use internal risk measures as key drivers to assess the possibility to grant a loan to an applicant. Credit scoring is a statistical approach used for evaluating potential loan applications in both financial and banking institutions. When applying for a loan, an applicant must fill out an application form detailing its characteristics (e.g., income, marital status, and loan purpose) that will serve as contributions to a credit scoring model which produces a score that is used to determine whether a loan should be granted or not. This enables faster and consistent credit approvals and the reduction of bad debt. Currently, many machine learning and statistical approaches such as logistic regression and tree-based algorithms have been used individually for credit scoring models. Newer contemporary machine learning techniques can outperform classic methods by simply combining models. This dissertation intends to be an empirical study on a publicly available bank loan dataset to study banking loan default, using ensemble-based techniques to increase model robustness and predictive power. The proposed ensemble method is based on stacking generalization an extension of various preceding studies that used different techniques to further enhance the model predictive capabilities. The results show that combining different models provides a great deal of flexibility to credit scoring models

    A comparative study of tree-based models for churn prediction : a case study in the telecommunication sector

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Marketing Research e CRMIn the recent years the topic of customer churn gains an increasing importance, which is the phenomena of the customers abandoning the company to another in the future. Customer churn plays an important role especially in the more saturated industries like telecommunication industry. Since the existing customers are very valuable and the acquisition cost of new customers is very high nowadays. The companies want to know which of their customers and when are they going to churn to another provider, so that measures can be taken to retain the customers who are at risk of churning. Such measures could be in the form of incentives to the churners, but the downside is the wrong classification of a churners will cost the company a lot, especially when incentives are given to some non-churner customers. The common challenge to predict customer churn will be how to pre-process the data and which algorithm to choose, especially when the dataset is heterogeneous which is very common for telecommunication companies’ datasets. The presented thesis aims at predicting customer churn for telecommunication sector using different decision tree algorithms and its ensemble models

    Fine-Grained Static Detection of Obfuscation Transforms Using Ensemble-Learning and Semantic Reasoning

    Get PDF
    International audienceThe ability to efficiently detect the software protections used is at a prime to facilitate the selection and application of adequate deob-fuscation techniques. We present a novel approach that combines semantic reasoning techniques with ensemble learning classification for the purpose of providing a static detection framework for obfuscation transformations. By contrast to existing work, we provide a methodology that can detect multiple layers of obfuscation, without depending on knowledge of the underlying functionality of the training-set used. We also extend our work to detect constructions of obfuscation transformations, thus providing a fine-grained methodology. To that end, we provide several studies for the best practices of the use of machine learning techniques for a scalable and efficient model. According to our experimental results and evaluations on obfuscators such as Tigress and OLLVM, our models have up to 91% accuracy on state-of-the-art obfuscation transformations. Our overall accuracies for their constructions are up to 100%

    Interpretable Machine Learning Model for Clinical Decision Making

    Get PDF
    Despite machine learning models being increasingly used in medical decision-making and meeting classification predictive accuracy standards, they remain untrusted black-boxes due to decision-makers\u27 lack of insight into their complex logic. Therefore, it is necessary to develop interpretable machine learning models that will engender trust in the knowledge they generate and contribute to clinical decision-makers intention to adopt them in the field. The goal of this dissertation was to systematically investigate the applicability of interpretable model-agnostic methods to explain predictions of black-box machine learning models for medical decision-making. As proof of concept, this study addressed the problem of predicting the risk of emergency readmissions within 30 days of being discharged for heart failure patients. Using a benchmark data set, supervised classification models of differing complexity were trained to perform the prediction task. More specifically, Logistic Regression (LR), Random Forests (RF), Decision Trees (DT), and Gradient Boosting Machines (GBM) models were constructed using the Healthcare Cost and Utilization Project (HCUP) Nationwide Readmissions Database (NRD). The precision, recall, area under the ROC curve for each model were used to measure predictive accuracy. Local Interpretable Model-Agnostic Explanations (LIME) was used to generate explanations from the underlying trained models. LIME explanations were empirically evaluated using explanation stability and local fit (R2). The results demonstrated that local explanations generated by LIME created better estimates for Decision Trees (DT) classifiers

    Classification of One-Year Student Persistence: A Machine Learning Approach

    Get PDF
    Multilayer perceptron neural networks, Gaussian naïve Bayes, and logistic regression classifiers were compared when used to make early predictions regarding one-year college student persistence. Two iterations of each model were built, utilizing a grid search process within 10-fold cross-validation in order to tune model parameters for optimal performance on the classification metrics F-Beta and F-1. The results of logistic regression, the historically favored approach in the domain, were compared to the alternative approaches of multilayer perceptron and naïve Bayes based primarily on FBeta and F-1 score performance on a hold-out dataset. A single logistic regression model was found to perform optimally on both F-1 and F-Beta. The logistic regression model outperformed all four of the individual alternative models on the evaluation criteria of concern. A majority voting ensemble and two additional ensembles with empirically derived weights were also applied to the hold-out set. The logistic regression model also outperformed all three ensemble models on the scoring metrics of concern. A visualization technique for comparing and summarizing case-level classifier performance was introduced. The features used in the modeling process comprised traditional and non-traditional elements

    Multi-tier framework for the inferential measurement and data-driven modeling

    Get PDF
    A framework for the inferential measurement and data-driven modeling has been proposed and assessed in several real-world application domains. The architecture of the framework has been structured in multiple tiers to facilitate extensibility and the integration of new components. Each of the proposed four tiers has been assessed in an uncoupled way to verify their suitability. The first tier, dealing with exploratory data analysis, has been assessed with the characterization of the chemical space related to the biodegradation of organic chemicals. This analysis has established relationships between physicochemical variables and biodegradation rates that have been used for model development. At the preprocessing level, a novel method for feature selection based on dissimilarity measures between Self-Organizing maps (SOM) has been developed and assessed. The proposed method selected more features than others published in literature but leads to models with improved predictive power. Single and multiple data imputation techniques based on the SOM have also been used to recover missing data in a Waste Water Treatment Plant benchmark. A new dynamic method to adjust the centers and widths of in Radial basis Function networks has been proposed to predict water quality. The proposed method outperformed other neural networks. The proposed modeling components have also been assessed in the development of prediction and classification models for biodegradation rates in different media. The results obtained proved the suitability of this approach to develop data-driven models when the complex dynamics of the process prevents the formulation of mechanistic models. The use of rule generation algorithms and Bayesian dependency models has been preliminary screened to provide the framework with interpretation capabilities. Preliminary results obtained from the classification of Modes of Toxic Action (MOA) indicate that this could be a promising approach to use MOAs as proxy indicators of human health effects of chemicals.Finally, the complete framework has been applied to three different modeling scenarios. A virtual sensor system, capable of inferring product quality indices from primary process variables has been developed and assessed. The system was integrated with the control system in a real chemical plant outperforming multi-linear correlation models usually adopted by chemical manufacturers. A model to predict carcinogenicity from molecular structure for a set of aromatic compounds has been developed and tested. Results obtained after the application of the SOM-dissimilarity feature selection method yielded better results than models published in the literature. Finally, the framework has been used to facilitate a new approach for environmental modeling and risk management within geographical information systems (GIS). The SOM has been successfully used to characterize exposure scenarios and to provide estimations of missing data through geographic interpolation. The combination of SOM and Gaussian Mixture models facilitated the formulation of a new probabilistic risk assessment approach.Aquesta tesi proposa i avalua en diverses aplicacions reals, un marc general de treball per al desenvolupament de sistemes de mesurament inferencial i de modelat basats en dades. L'arquitectura d'aquest marc de treball s'organitza en diverses capes que faciliten la seva extensibilitat així com la integració de nous components. Cadascun dels quatre nivells en que s'estructura la proposta de marc de treball ha estat avaluat de forma independent per a verificar la seva funcionalitat. El primer que nivell s'ocupa de l'anàlisi exploratòria de dades ha esta avaluat a partir de la caracterització de l'espai químic corresponent a la biodegradació de certs compostos orgànics. Fruit d'aquest anàlisi s'han establert relacions entre diverses variables físico-químiques que han estat emprades posteriorment per al desenvolupament de models de biodegradació. A nivell del preprocés de les dades s'ha desenvolupat i avaluat una nova metodologia per a la selecció de variables basada en l'ús del Mapes Autoorganitzats (SOM). Tot i que el mètode proposat selecciona, en general, un major nombre de variables que altres mètodes proposats a la literatura, els models resultants mostren una millor capacitat predictiva. S'han avaluat també tot un conjunt de tècniques d'imputació de dades basades en el SOM amb un conjunt de dades estàndard corresponent als paràmetres d'operació d'una planta de tractament d'aigües residuals. Es proposa i avalua en un problema de predicció de qualitat en aigua un nou model dinàmic per a ajustar el centre i la dispersió en xarxes de funcions de base radial. El mètode proposat millora els resultats obtinguts amb altres arquitectures neuronals. Els components de modelat proposat s'han aplicat també al desenvolupament de models predictius i de classificació de les velocitats de biodegradació de compostos orgànics en diferents medis. Els resultats obtinguts demostren la viabilitat d'aquesta aproximació per a desenvolupar models basats en dades en aquells casos en els que la complexitat de dinàmica del procés impedeix formular models mecanicistes. S'ha dut a terme un estudi preliminar de l'ús de algorismes de generació de regles i de grafs de dependència bayesiana per a introduir una nova capa que faciliti la interpretació dels models. Els resultats preliminars obtinguts a partir de la classificació dels Modes d'acció Tòxica (MOA) apunten a que l'ús dels MOA com a indicadors intermediaris dels efectes dels compostos químics en la salut és una aproximació factible.Finalment, el marc de treball proposat s'ha aplicat en tres escenaris de modelat diferents. En primer lloc, s'ha desenvolupat i avaluat un sensor virtual capaç d'inferir índexs de qualitat a partir de variables primàries de procés. El sensor resultant ha estat implementat en una planta química real millorant els resultats de les correlacions multilineals emprades habitualment. S'ha desenvolupat i avaluat un model per a predir els efectes carcinògens d'un grup de compostos aromàtics a partir de la seva estructura molecular. Els resultats obtinguts desprès d'aplicar el mètode de selecció de variables basat en el SOM milloren els resultats prèviament publicats. Aquest marc de treball s'ha usat també per a proporcionar una nova aproximació al modelat ambiental i l'anàlisi de risc amb sistemes d'informació geogràfica (GIS). S'ha usat el SOM per a caracteritzar escenaris d'exposició i per a desenvolupar un nou mètode d'interpolació geogràfica. La combinació del SOM amb els models de mescla de gaussianes dona una nova formulació al problema de l'anàlisi de risc des d'un punt de vista probabilístic

    Euclid preparation: XXII. Selection of Quiescent Galaxies from Mock Photometry using Machine Learning

    Full text link
    The Euclid Space Telescope will provide deep imaging at optical and near-infrared wavelengths, along with slitless near-infrared spectroscopy, across ~15,000 sq deg of the sky. Euclid is expected to detect ~12 billion astronomical sources, facilitating new insights into cosmology, galaxy evolution, and various other topics. To optimally exploit the expected very large data set, there is the need to develop appropriate methods and software. Here we present a novel machine-learning based methodology for selection of quiescent galaxies using broad-band Euclid I_E, Y_E, J_E, H_E photometry, in combination with multiwavelength photometry from other surveys. The ARIADNE pipeline uses meta-learning to fuse decision-tree ensembles, nearest-neighbours, and deep-learning methods into a single classifier that yields significantly higher accuracy than any of the individual learning methods separately. The pipeline has `sparsity-awareness', so that missing photometry values are still informative for the classification. Our pipeline derives photometric redshifts for galaxies selected as quiescent, aided by the `pseudo-labelling' semi-supervised method. After application of the outlier filter, our pipeline achieves a normalized mean absolute deviation of ~< 0.03 and a fraction of catastrophic outliers of ~< 0.02 when measured against the COSMOS2015 photometric redshifts. We apply our classification pipeline to mock galaxy photometry catalogues corresponding to three main scenarios: (i) Euclid Deep Survey with ancillary ugriz, WISE, and radio data; (ii) Euclid Wide Survey with ancillary ugriz, WISE, and radio data; (iii) Euclid Wide Survey only. Our classification pipeline outperforms UVJ selection, in addition to the Euclid I_E-Y_E, J_E-H_E and u-I_E,I_E-J_E colour-colour methods, with improvements in completeness and the F1-score of up to a factor of 2. (Abridged)Comment: 37 pages (including appendices), 26 figures; accepted for publication in Astronomy & Astrophysic
    • …
    corecore