17 research outputs found

    Learning algorithm selection for comprehensible regression analysis using datasetoids

    No full text
    Data mining tools often include a workbench of algorithms to model a given dataset but lack sufficient guidance to select the most accurate algorithm given a certain dataset. The best algorithm is not known in advance and no single model format is superior for all datasets. Evaluating a number of candidate algorithms on large datasets to determine the most accurate model is however a computational burden. An alternative and more time efficient way is to select the optimal algorithm based on the nature of the dataset. In this meta-learning study, it is explored to what degree dataset characteristics can help identify which regression/estimation algorithm will best fit a given dataset. We chose to focus on comprehensible `white-box' techniques in particular (i.e. linear, spline, tree, linear tree or spline tree) as those are of particular interest in many real-life estimation settings. A large scale experiment with more than thousand so called datasetoids representing various real-life dependencies is conducted to discover possible relations. It is found that algorithm based characteristics such as sampling landmarks are major drivers for successfully selecting the most accurate algorithm. Further, it is found that data based characteristics such as the length, dimensionality and composition of the independent variables, or the asymmetry and dispersion of the dependent variable appear to contribute little once landmarks are included in the meta-model

    Benchmarking regression algorithms for loss given default modeling

    No full text
    The introduction of the Basel II Accord has had a huge impact on financial institutions, allowing them to build credit risk models for three key risk parameters: PD (probability of default), LGD (loss given default) and EAD (exposure at default). Until recently, credit risk research has focused largely on the estimation and validation of the PD parameter, and much less on LGD modeling. In this first large-scale LGD benchmarking study, various regression techniques for modeling and predicting LGD are investigated. These include one-stage models, such as those built by ordinary least squares regression, beta regression, robust regression, ridge regression, regression splines, neural networks, support vector machines and regression trees, as well as two-stage models which combine multiple techniques. A total of 24 techniques are compared using six real-life loss datasets from major international banks. It is found that much of the variance in LGD remains unexplained, as the average prediction performance of the models in terms of R2 ranges from 4% to 43%. Nonetheless, there is a clear trend that non-linear techniques, and in particular support vector machines and neural networks, perform significantly better than more traditional linear techniques. Also, two-stage models built by a combination of linear and non-linear techniques are shown to have a similarly good predictive power, with the added advantage of having a comprehensible linear model component.<br/

    A proposed framework for backtesting loss given default models

    Get PDF
    The Basel Accords require financial institutions to regularly validate their loss given default (LGD) models. This is crucial so banks are not misestimating the minimum required capital to protect them against the risks they are facing through their lending activities. The validation of an LGD model typically includes backtesting, which involves the process of evaluating to what degree the internal model estimates still correspond with the realized observations. Reported backtesting examples have typically been limited to simply measuring the similarity between model predictions and realized observations. It is however not straightforward to determine acceptable performance based on these measurements alone. Although recent research led to advanced backtesting methods for PD models, the literature on similar backtesting methods for LGD models is much scarcer. This study addresses this literature gap by proposing a backtesting framework using statistical hypothesis tests to support the validation of LGD models. The proposed statistical hypothesis tests implicitly define reliable reference values to determine acceptable performance and take into account the number of LGD observations, as a small sample may affect the quality of the backtesting procedure. This workbench of tests is applied to an LGD model fitted to real-life data and evaluated through a statistical power analysis

    Modelování ztráty při defaultu v P2P úvěrování pomocí náhodných stromů

    No full text
    Modelling credit risk in peer-to-peer (P2P) lending is increasingly important due to the rapid growth of P2P platforms’ user bases. To support decision making on granting P2P loans, diverse machine learning methods have been used in P2P credit risk models. However, such models have been limited to loan default prediction, without considering the financial impact of the loans. Loss given default (LGD) is used in modelling consumer credit risk to address this issue. Earlier approaches to modelling LGD in P2P lending tended to use multivariate linear regression methods in order to identify the determinants of P2P loans’ credit risk. Here, we show that these methods are not effective enough to process complex features present in P2P lending data. We propose a novel decision support system to LGD modeling in P2P lending. To reduce the problem of overfitting, the system uses random forest (RF) learning in two stages. First, extremely risky loans with LGD = 1 are identified using classification RF. Second, the LGD of the remaining P2P loans is predicted using regression RF. Thus, the non-normal distribution of the LGD values can be effectively modelled. We demonstrate that the proposed system is effective for the benchmark of P2P Lending Club platform as other methods currently used in LGD modelling are outperformed.Modelování úvěrového rizika v půjčování typu peer-to-peer (P2P) je stále důležitější díky rychlému růstu uživatelských základen platforem P2P. Pro podporu rozhodování o poskytování půjček P2P byly v modelech úvěrového rizika P2P použity různé metody strojového učení. Takové modely však byly omezeny na predikci úvěrového selhání, aniž by se bral v úvahu finanční dopad úvěrů. Ztráta při selhání (LGD) se používá k modelování spotřebitelského úvěrového rizika k řešení tohoto problému. Dřívější přístupy k modelování půjček na LGDin P2P inklinovaly k použití více lineárních regresních metod s cílem identifikovat determinanty úvěrového rizika půjček P2P. Zde ukazujeme, že tyto metody nejsou dostatečně účinné pro zpracování komplexních funkcí přítomných v datech půjčování P2P. Navrhujeme nový systém podpory rozhodování pro modelování LGD v půjčování P2P. Aby se snížil problém nadměrného vybavení, systém používá učení ve dvou náhodných lesích (RF). Nejprve jsou pomocí klasifikace RF identifikovány extrémně rizikové půjčky s LGD = 1. Za druhé, LGD zbývajících P2P půjček se předpovídá pomocí regresní RF. Lze tedy efektivně modelovat neobvyklé rozdělení hodnot LGD. Prokazujeme, že navrhovaný systém je účinný pro měřítko platformy P2P Lending Club, protože jiné metody, které se v současné době používají při modelování LGD, jsou překonány
    corecore