1,514 research outputs found
Recent methods from statistics and machine learning for credit scoring
Credit scoring models are the basis for financial institutions like retail and consumer credit banks. The purpose of the models is to evaluate the likelihood of credit applicants defaulting in order to decide whether to grant them credit. The area under the receiver operating characteristic (ROC) curve (AUC) is one of the most commonly used measures to evaluate predictive performance in credit scoring. The aim of this thesis is to benchmark different methods for building scoring models in order to maximize the AUC. While this measure is used to evaluate the predictive accuracy of the presented algorithms, the AUC is especially introduced as direct optimization criterion.
The logistic regression model is the most widely used method for creating credit scorecards and classifying applicants into risk classes. Since this development process, based on the logit model, is standard in the retail banking practice, the predictive accuracy of this proceeding is used for benchmark reasons throughout this thesis.
The AUC approach is a main task introduced within this work. Instead of using the
maximum likelihood estimation, the AUC is considered as objective function to optimize it directly. The coefficients are estimated by calculating the AUC measure with Wilcoxon-Mann-Whitney and by using the Nelder-Mead algorithm for the optimization. The AUC optimization denotes a distribution-free approach, which is analyzed within a simulation study for investigating the theoretical considerations. It can be shown that the approach still works even if the underlying distribution is not logistic.
In addition to the AUC approach and classical well-known methods like generalized additive models, new methods from statistics and machine learning are evaluated for the credit scoring case. Conditional inference trees, model-based recursive partitioning methods and random forests are presented as recursive partitioning algorithms. Boosting algorithms are also explored by additionally using the AUC as a loss function.
The empirical evaluation is based on data from a German bank. From the application scoring, 26 attributes are included in the analysis. Besides the AUC, different performance measures are used for evaluating the predictive performance of scoring models. While classification trees cannot improve predictive accuracy for the current credit scoring case, the AUC approach and special boosting methods provide outperforming results compared to the robust classical scoring models regarding the predictive performance with the AUC measure.Scoringmodelle dienen Finanzinstituten als Grundlage dafür, die Ausfallwahrscheinlichkeit von Kreditantragstellern zu berechnen und zu entscheiden ob ein Kredit gewährt wird oder nicht. Das AUC (area under the receiver operating characteristic curve) ist eines der am häufigsten verwendeten Maße, um die Vorhersagekraft im Kreditscoring zu bewerten. Demzufolge besteht das Ziel dieser Arbeit darin, verschiedene Methoden zur Scoremodell-Bildung hinsichtlich eines optimierten AUC Maßes zu „benchmarken“. Während das genannte Maß dazu dient die vorgestellten Algorithmen hinsichtlich ihrer Trennschärfe zu bewerten, wird das AUC insbesondere als direktes Optimierungskriterium eingeführt.
Die logistische Regression ist das am häufigsten verwendete Verfahren zur Entwicklung von Scorekarten und die Einteilung der Antragsteller in Risikoklassen. Da der Entwicklungsprozess mittels logistischer Regression im Retail-Bankenbereich stark etabliert ist, wird die Trennschärfe dieses Verfahrens in der vorliegenden Arbeit als Benchmark verwendet.
Der AUC Ansatz wird als entscheidender Teil dieser Arbeit vorgestellt. Anstatt die Maximum Likelihood Schätzung zu verwenden, wird das AUC als direkte Zielfunktion zur Optimierung verwendet. Die Koeffizienten werden geschätzt, indem für die Berechnung des AUC die Wilcoxon Statistik und für die Optimierung der Nelder-Mead Algorithmus verwendet wird. Die AUC Optimierung stellt einen verteilungsfreien Ansatz dar, der im Rahmen einer Simulationsstudie untersucht wird, um die theoretischen Überlegungen zu analysieren. Es kann gezeigt werden, dass der Ansatz auch dann funktioniert, wenn in den Daten kein logistischer Zusammenhang vorliegt.
Zusätzlich zum AUC Ansatz und bekannten Methoden wie Generalisierten Additiven Modellen, werden neue Methoden aus der Statistik und dem Machine Learning für das Kreditscoring evaluiert. Klassifikationsbäume, Modell-basierte Recursive Partitioning Methoden und Random Forests werden als Recursive Paritioning Methoden vorgestellt. Darüberhinaus werden Boosting Algorithmen untersucht, die auch das AUC Maß als Verlustfunktion verwenden.
Die empirische Analyse basiert auf Daten einer deutschen Kreditbank. 26 Variablen werden im Rahmen der Analyse untersucht. Neben dem AUC Maß werden verschiedene Performancemaße verwendet, um die Trennschärfe von Scoringmodellen zu bewerten. Während Klassifikationsbäume im vorliegenden Kreditscoring Fall keine Verbesserungen erzielen, weisen der AUC Ansatz und einige Boosting Verfahren gute Ergebnisse im Vergleich zum robusten klassischen Scoringmodell hinsichtlich des AUC Maßes auf
Quantifying the relationship between food sharing practices and socio-ecological variables in small-scale societies: A cross-cultural multi-methodological approach
This article presents a cross-cultural study of the relationship among the subsistence strategies, the environmental setting and the food sharing practices of 22 modern small-scale societies located in America (n = 18) and Siberia (n = 4). Ecological, geographical and economic variables of these societies were extracted from specialized literature and the publicly available D-PLACE database. The approach proposed comprises a variety of quantitative methods, ranging from exploratory techniques aimed at capturing relationships of any type between variables, to network theory and supervised-learning predictive modelling. Results provided by all techniques consistently show that the differences observed in food sharing practices across the sampled populations cannot be explained just by the differential distribution of ecological, geographical and economic variables. Food sharing has to be interpreted as a more complex cultural phenomenon, whose variation over time and space cannot be ascribed only to local adaptation.Spanish Ministry of Science, Innovation and Universities: SimulPast Project (CSD2010-00034 CONSOLIDER-INGENIO 2010), (VA, JC, EB, DZ, MM, JMG), Consolider Excellence Network (HAR2017-90883-REDC) (VA, JC, EB, DZ, MM, JMG), and CULM Project (HAR2016-77672-P)(DZ, JC, MM)
A guided analytics tool for feature selection in steel manufacturing with an application to blast furnace top gas efficiency
In knowledge intensive industries such as steel manufacturing, application of data analytics to optimise process performance, requires effective knowledge transfer between domain experts and data scientists. This is often an inefficient path to follow, requiring much iteration whilst being suboptimal with regard to organisational knowledge capture for the long term. With the ‘initial Guided Analytics for parameter Testing and controlband Extraction (iGATE)’ tool we created a feature selection framework that finds influential process parameters and their optimal control bands and which can easily be made available to process operators in the form of guided analytics tool, while allowing them to modify the analysis according to their expertise. The method is embedded in a work flow whereby the extracted parameters and control bands are verified by the domain expert and a report of the analysis is automatically generated. The approach allows us to combine the power of suitable statistical analysis with process-expertise, whilst dramatically reducing the time needed for conducting the feature selection. We regard this application as a stepping stone to gain user confidence in advance of introduction of more autonomous analytics approaches. We present the statistical foundations of iGATE and illustrate its effectiveness in the form of a case study of Tata Steel blast furnace data. We have made the iGATE core functionality freely available in the igate package for the R programming language
Recommended from our members
Momentum Effects: Essays on Trading Rule Returns in G10 Currency Pairs
Chapter 1: Momentum Effects: G10 Currency Return Survivals
The chapter analyses momentum effects in G10 currencies. For each of the currency crosses within the G10 universe the chapter models the “survival” probabilities of trading signals obtained from a wide set of dual crossover moving average combinations. The application of statistical tools that stem from survival time analysis sheds light on the subject of market efficiency within the currency market. Empirical momentum signals from shorter-term trading rules outlive respective benchmark signals, while longer-term moving average crossover signals have lower life expectancy than theory would suggest. Furthermore, a trading strategy constructed from a subset of short-term moving average signals exhibits clear outperformance over a trading strategy that is generically composed from all moving average crossover signals. This outperformance persists over time.
Chapter 2: Momentum Effects: G10 Currency Return Survivals, Implications for Trading Rules
The chapter models survival probabilities of positive and negative momentum signals that are obtained from a wide set of dual crossover moving average combinations for all G10 cross currency pairs. The results of this survival analysis are used to create trading rule enhancements that aim to outperform generic dual crossover moving average trading signals. The trading rule enhancements are assessed, by applying White’s (1999) “data snooper”. The results suggest that there is scope for trading rule enhancements to outperform generic trading rules. Moreover, results present strong evidence for Lo’s (2004) Adaptive Market Hypothesis.
Chapter 3: Momentum effects: Dissecting Generic G10 Trading Rule Returns
The chapter builds on the work of Pojarliev and Levich (2008, 2010), who dissect the returns of active currency managers by applying a multiple ordinary least squares (OLS) regression to currency fund returns. Where the chapter differs is in the specification of the dependent variable, which is in the context of the present chapter a set of trading rule parameterisations that are applied to a broad range of currency pairs. The results of this chapter suggest that there is some alpha embedded in the returns of technical trading rules. The chapter also establishes a comparatively strong positive, statistically significant link between the risk factors Trend, Momentum, Risk Aversion. The results of the chapter clearly indicate that shorter-term moving averages exhibit less systematic exposure than longer term moving averages. Other factors such as Carry, Value and Volatility have a considerably less pronounced relationship; only few factor sensitivities are statistically significant. Moreover, the results also indicate that systematic risk exposures of trend following trading strategies change with small adjustments in the design of trading rules
Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery
This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model
- …