1,514 research outputs found

    Recent methods from statistics and machine learning for credit scoring

    Get PDF
    Credit scoring models are the basis for financial institutions like retail and consumer credit banks. The purpose of the models is to evaluate the likelihood of credit applicants defaulting in order to decide whether to grant them credit. The area under the receiver operating characteristic (ROC) curve (AUC) is one of the most commonly used measures to evaluate predictive performance in credit scoring. The aim of this thesis is to benchmark different methods for building scoring models in order to maximize the AUC. While this measure is used to evaluate the predictive accuracy of the presented algorithms, the AUC is especially introduced as direct optimization criterion. The logistic regression model is the most widely used method for creating credit scorecards and classifying applicants into risk classes. Since this development process, based on the logit model, is standard in the retail banking practice, the predictive accuracy of this proceeding is used for benchmark reasons throughout this thesis. The AUC approach is a main task introduced within this work. Instead of using the maximum likelihood estimation, the AUC is considered as objective function to optimize it directly. The coefficients are estimated by calculating the AUC measure with Wilcoxon-Mann-Whitney and by using the Nelder-Mead algorithm for the optimization. The AUC optimization denotes a distribution-free approach, which is analyzed within a simulation study for investigating the theoretical considerations. It can be shown that the approach still works even if the underlying distribution is not logistic. In addition to the AUC approach and classical well-known methods like generalized additive models, new methods from statistics and machine learning are evaluated for the credit scoring case. Conditional inference trees, model-based recursive partitioning methods and random forests are presented as recursive partitioning algorithms. Boosting algorithms are also explored by additionally using the AUC as a loss function. The empirical evaluation is based on data from a German bank. From the application scoring, 26 attributes are included in the analysis. Besides the AUC, different performance measures are used for evaluating the predictive performance of scoring models. While classification trees cannot improve predictive accuracy for the current credit scoring case, the AUC approach and special boosting methods provide outperforming results compared to the robust classical scoring models regarding the predictive performance with the AUC measure.Scoringmodelle dienen Finanzinstituten als Grundlage dafür, die Ausfallwahrscheinlichkeit von Kreditantragstellern zu berechnen und zu entscheiden ob ein Kredit gewährt wird oder nicht. Das AUC (area under the receiver operating characteristic curve) ist eines der am häufigsten verwendeten Maße, um die Vorhersagekraft im Kreditscoring zu bewerten. Demzufolge besteht das Ziel dieser Arbeit darin, verschiedene Methoden zur Scoremodell-Bildung hinsichtlich eines optimierten AUC Maßes zu „benchmarken“. Während das genannte Maß dazu dient die vorgestellten Algorithmen hinsichtlich ihrer Trennschärfe zu bewerten, wird das AUC insbesondere als direktes Optimierungskriterium eingeführt. Die logistische Regression ist das am häufigsten verwendete Verfahren zur Entwicklung von Scorekarten und die Einteilung der Antragsteller in Risikoklassen. Da der Entwicklungsprozess mittels logistischer Regression im Retail-Bankenbereich stark etabliert ist, wird die Trennschärfe dieses Verfahrens in der vorliegenden Arbeit als Benchmark verwendet. Der AUC Ansatz wird als entscheidender Teil dieser Arbeit vorgestellt. Anstatt die Maximum Likelihood Schätzung zu verwenden, wird das AUC als direkte Zielfunktion zur Optimierung verwendet. Die Koeffizienten werden geschätzt, indem für die Berechnung des AUC die Wilcoxon Statistik und für die Optimierung der Nelder-Mead Algorithmus verwendet wird. Die AUC Optimierung stellt einen verteilungsfreien Ansatz dar, der im Rahmen einer Simulationsstudie untersucht wird, um die theoretischen Überlegungen zu analysieren. Es kann gezeigt werden, dass der Ansatz auch dann funktioniert, wenn in den Daten kein logistischer Zusammenhang vorliegt. Zusätzlich zum AUC Ansatz und bekannten Methoden wie Generalisierten Additiven Modellen, werden neue Methoden aus der Statistik und dem Machine Learning für das Kreditscoring evaluiert. Klassifikationsbäume, Modell-basierte Recursive Partitioning Methoden und Random Forests werden als Recursive Paritioning Methoden vorgestellt. Darüberhinaus werden Boosting Algorithmen untersucht, die auch das AUC Maß als Verlustfunktion verwenden. Die empirische Analyse basiert auf Daten einer deutschen Kreditbank. 26 Variablen werden im Rahmen der Analyse untersucht. Neben dem AUC Maß werden verschiedene Performancemaße verwendet, um die Trennschärfe von Scoringmodellen zu bewerten. Während Klassifikationsbäume im vorliegenden Kreditscoring Fall keine Verbesserungen erzielen, weisen der AUC Ansatz und einige Boosting Verfahren gute Ergebnisse im Vergleich zum robusten klassischen Scoringmodell hinsichtlich des AUC Maßes auf

    Quantifying the relationship between food sharing practices and socio-ecological variables in small-scale societies: A cross-cultural multi-methodological approach

    Get PDF
    This article presents a cross-cultural study of the relationship among the subsistence strategies, the environmental setting and the food sharing practices of 22 modern small-scale societies located in America (n = 18) and Siberia (n = 4). Ecological, geographical and economic variables of these societies were extracted from specialized literature and the publicly available D-PLACE database. The approach proposed comprises a variety of quantitative methods, ranging from exploratory techniques aimed at capturing relationships of any type between variables, to network theory and supervised-learning predictive modelling. Results provided by all techniques consistently show that the differences observed in food sharing practices across the sampled populations cannot be explained just by the differential distribution of ecological, geographical and economic variables. Food sharing has to be interpreted as a more complex cultural phenomenon, whose variation over time and space cannot be ascribed only to local adaptation.Spanish Ministry of Science, Innovation and Universities: SimulPast Project (CSD2010-00034 CONSOLIDER-INGENIO 2010), (VA, JC, EB, DZ, MM, JMG), Consolider Excellence Network (HAR2017-90883-REDC) (VA, JC, EB, DZ, MM, JMG), and CULM Project (HAR2016-77672-P)(DZ, JC, MM)

    Bias-reduced doubly robust estimation

    Get PDF

    A guided analytics tool for feature selection in steel manufacturing with an application to blast furnace top gas efficiency

    Get PDF
    In knowledge intensive industries such as steel manufacturing, application of data analytics to optimise process performance, requires effective knowledge transfer between domain experts and data scientists. This is often an inefficient path to follow, requiring much iteration whilst being suboptimal with regard to organisational knowledge capture for the long term. With the ‘initial Guided Analytics for parameter Testing and controlband Extraction (iGATE)’ tool we created a feature selection framework that finds influential process parameters and their optimal control bands and which can easily be made available to process operators in the form of guided analytics tool, while allowing them to modify the analysis according to their expertise. The method is embedded in a work flow whereby the extracted parameters and control bands are verified by the domain expert and a report of the analysis is automatically generated. The approach allows us to combine the power of suitable statistical analysis with process-expertise, whilst dramatically reducing the time needed for conducting the feature selection. We regard this application as a stepping stone to gain user confidence in advance of introduction of more autonomous analytics approaches. We present the statistical foundations of iGATE and illustrate its effectiveness in the form of a case study of Tata Steel blast furnace data. We have made the iGATE core functionality freely available in the igate package for the R programming language

    Probabilistic index models

    Get PDF

    Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery

    Get PDF
    This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model
    corecore