75 research outputs found

    Basel II compliant credit risk modelling: model development for imbalanced credit scoring data sets, loss given default (LGD) and exposure at default (EAD)

    No full text
    The purpose of this thesis is to determine and to better inform industry practitioners to the most appropriate classification and regression techniques for modelling the three key credit risk components of the Basel II minimum capital requirement; probability of default (PD), loss given default (LGD), and exposure at default (EAD). The Basel II accord regulates risk and capital management requirements to ensure that a bank holds enough capital proportional to the exposed risk of its lending practices. Under the advanced internal ratings based (IRB) approach Basel II allows banks to develop their own empirical models based on historical data for each of PD, LGD and EAD.In this thesis, first the issue of imbalanced credit scoring data sets, a special case of PD modelling where the number of defaulting observations in a data set is much lower than the number of observations that do not default, is identified, and the suitability of various classification techniques are analysed and presented. As well as using traditional classification techniques this thesis also explores the suitability of gradient boosting, least square support vector machines and random forests as a form of classification. The second part of this thesis focuses on the prediction of LGD, which measures the economic loss, expressed as a percentage of the exposure, in case of default. In this thesis, various state-of-the-art regression techniques to model LGD are considered. In the final part of this thesis we investigate models for predicting the exposure at default (EAD). For off-balance-sheet items (for example credit cards) to calculate the EAD one requires the committed but unused loan amount times a credit conversion factor (CCF). Ordinary least squares (OLS), logistic and cumulative logistic regression models are analysed, as well as an OLS with Beta transformation model, with the main aim of finding the most robust and comprehensible model for the prediction of the CCF. Also a direct estimation of EAD, using an OLS model, will be analysed. All the models built and presented in this thesis have been applied to real-life data sets from major global banking institutions

    Examining the Transitional Impact of ICD-10 on Healthcare Fraud Detection

    Get PDF
    On October 1st, 2015, the tenth revision of the International Classification of Diseases (ICD-10) will be mandatorily implemented in the United States. Although this medical classification system will allow healthcare professionals to code with greater accuracy, specificity, and detail, these codes will have a significant impact on the flavor of healthcare insurance claims. While the overall benefit of ICD-10 throughout the healthcare industry is unquestionable, some experts believe healthcare fraud detection and prevention could experience an initial drop in performance due to the implementation of ICD-10. We aim to quantitatively test the validity of this concern regarding an adverse transitional impact. This project explores how predictive fraud detection systems developed using ICD-9 claims data will initially react to the introduction of ICD-10. We have developed a basic fraud detection system incorporating both unsupervised and supervised learning methods in order to examine the potential fraudulence of both ICD-9 and ICD-10 claims in a predictive environment. Using this system, we are able to analyze the ability and performance of statistical methods trained using ICD-9 data to properly identify fraudulent ICD-10 claims. This research makes contributions to the domains of medical coding, healthcare informatics, and fraud detection

    Realising advanced risk-based port state control inspection using data-driven Bayesian networks

    Get PDF
    In the past decades, maritime transportation not only contributes to economic prosperity, but also renders many threats to the industry, causing huge casualties and losses. As a result, various maritime safety measures have been developed, including Port State Control (PSC) inspections. In this paper, we propose a data-driven Bayesian Network (BN) based approach to analyse risk factors influencing PSC inspections, and predict the probability of vessel detention. To do so, inspection data of bulk carriers in seven major European countries from 2005 to 2008 1 in Paris MoU is collected to identify the relevant risk factors. Meanwhile, the network structure is constructed via TAN learning and subsequently validated by sensitivity analysis. The results reveal two conclusions: first, the key risk factors influencing PSC inspections include number of deficiencies, type of inspection, Recognised Organisation (RO) and vessel age. Second, the model exploits a novel way to predict the detention probabilities under different situations, which effectively help port authorities to rationalise their inspection regulations as well as allocation of the resources. Further effort will be made to conduct contrastive analysis between ‘Pre-NIR’ period and ‘Post-NIR’ period to test the impact of NIR started in 2008. © 2018 Elsevier Lt

    Profiling patterns of interhelical associations in membrane proteins.

    Get PDF
    A novel set of methods has been developed to characterize polytopic membrane proteins at the topological, organellar and functional level, in order to reduce the existing functional gap in the membrane proteome. Firstly, a novel clustering tool was implemented, named PROCLASS, to facilitate the manual curation of large sets of proteins, in readiness for feature extraction. TMLOOP and TMLOOP writer were implemented to refine current topological models by predicting membrane dipping loops. TMLOOP applies weighted predictive rules in a collective motif method, to overcome the inherent limitations of single motif methods. The approach achieved 92.4% accuracy in sensitivity and 100% reliability in specificity and 1,392 topological models described in the Swiss-Prot database were refined. The subcellular location (TMLOCATE) and molecular function (TMFUN) prediction methods rely on the TMDEPTH feature extraction method along data mining techniques. TMDEPTH uses refined topological models and amino acid sequences to calculate pairs of residues located at a similar depth in the membrane. Evaluation of TMLOCATE showed a normalized accuracy of 75% in discriminating between proteins belonging to the main organelles. At a sequence similarity threshold of 40%, TMFLTN predicted main functional classes with a sensitivity of 64.1-71.4%) and 70% of the olfactory GPCRs were correctly predicted. At a sequence similarity threshold of 90%, main functional classes were predicted with a sensitivity of 75.6-92.8%) and class A GPCRs were sub-classified with a sensitivity of 84.5%>-92.9%. These results reflect a direct association between the spatial arrangement of residues in the transmembrane regions and the capacity for polytopic membrane proteins to carry out their functions. The developed methods have for the first time categorically shown that the transmembrane regions hold essential information associated with a wide range of functional properties such as filtering and gating processes, subcellular location and molecular function

    A review of clustering techniques and developments

    Full text link
    © 2017 Elsevier B.V. This paper presents a comprehensive study on clustering: exiting methods and developments made at various times. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering the objects such as hierarchical, partitional, grid, density based and model based. The approaches used in these methods are discussed with their respective states of art and applicability. The measures of similarity as well as the evaluation criteria, which are the central components of clustering, are also presented in the paper. The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted

    On the Use of Speech and Face Information for Identity Verification

    Get PDF
    {T}his report first provides a review of important concepts in the field of information fusion, followed by a review of important milestones in audio-visual person identification and verification. {S}everal recent adaptive and non-adaptive techniques for reaching the verification decision (i.e., to accept or reject the claimant), based on speech and face information, are then evaluated in clean and noisy audio conditions on a common database; it is shown that in clean conditions most of the non-adaptive approaches provide similar performance and in noisy conditions most exhibit a severe deterioration in performance; it is also shown that current adaptive approaches are either inadequate or utilize restrictive assumptions. A new category of classifiers is then introduced, where the decision boundary is fixed but constructed to take into account how the distributions of opinions are likely to change due to noisy conditions; compared to a previously proposed adaptive approach, the proposed classifiers do not make a direct assumption about the type of noise that causes the mismatch between training and testing conditions. {T}his report is an extended and revised version of {IDIAP-RR} 02-33

    Identity Verification Using Speech and Face Information

    Get PDF
    This article first provides an review of important concepts in the field of information fusion, followed by a review of important milestones in audio–visual person identification and verification. Several recent adaptive and nonadaptive techniques for reaching the verification decision (i.e., to accept or reject the claimant), based on speech and face information, are then evaluated in clean and noisy audio conditions on a common database; it is shown that in clean conditions most of the nonadaptive approaches provide similar performance and in noisy conditions most exhibit a severe deterioration in performance; it is also shown that current adaptive approaches are either inadequate or utilize restrictive assumptions. A new category of classifiers is then introduced, where the decision boundary is fixed but constructed to take into account how the distributions of opinions are likely to change due to noisy conditions; compared to a previously proposed adaptive approach, the proposed classifiers do not make a direct assumption about the type of noise that causes the mismatch between training and testing conditions

    Context-Based classification of objects in topographic data

    Get PDF
    Large-scale topographic databases model real world features as vector data objects. These can be point, line or area features. Each of these map objects is assigned to a descriptive class; for example, an area feature might be classed as a building, a garden or a road. Topographic data is subject to continual updates from cartographic surveys and ongoing quality improvement. One of the most important aspects of this is assignment and verification of class descriptions to each area feature. These attributes can be added manually, but, due to the vast volume of data involved, automated techniques are desirable to classify these polygons. Analogy is a key thought process that underpins learning and has been the subject of much research in the field of artificial intelligence (AI). An analogy identifies structural similarity between a well-known source domain and a less familiar target domain. In many cases, information present in the source can then be mapped to the target, yielding a better understanding of the latter. The solution of geometric analogy problems has been a fruitful area of AI research. We observe that there is a correlation between objects in geometric analogy problem domains and map features in topographic data. We describe two topographic area feature classification tools that use descriptions of neighbouring features to identify analogies between polygons: content vector matching (CVM) and context structure matching (CSM). CVM and CSM classify an area feature by matching its neighbourhood context against those of analogous polygons whose class is known. Both classifiers were implemented and then tested on high quality topographic polygon data supplied by Ordnance Survey (Great Britain). Area features were found to exhibit a high degree of variation in their neighbourhoods. CVM correctly classified 85.38% of the 79.03% of features it attempted to classify. The accuracy for CSM was 85.96% of the 62.96% of features it tried to identify. Thus, CVM can classify 25.53% more features than CSM, but is slightly less accurate. Both techniques excelled at identifying the feature classes that predominate in suburban data. Our structure-based classification approach may also benefit other types of spatial data, such as topographic line data, small-scale topographic data, raster data, architectural plans and circuit diagrams

    Classifier Ensemble Feature Selection for Automatic Fault Diagnosis

    Get PDF
    "An efficient ensemble feature selection scheme applied for fault diagnosis is proposed, based on three hypothesis: a. A fault diagnosis system does not need to be restricted to a single feature extraction model, on the contrary, it should use as many feature models as possible, since the extracted features are potentially discriminative and the feature pooling is subsequently reduced with feature selection; b. The feature selection process can be accelerated, without loss of classification performance, combining feature selection methods, in a way that faster and weaker methods reduce the number of potentially non-discriminative features, sending to slower and stronger methods a filtered smaller feature set; c. The optimal feature set for a multi-class problem might be different for each pair of classes. Therefore, the feature selection should be done using an one versus one scheme, even when multi-class classifiers are used. However, since the number of classifiers grows exponentially to the number of the classes, expensive techniques like Error-Correcting Output Codes (ECOC) might have a prohibitive computational cost for large datasets. Thus, a fast one versus one approach must be used to alleviate such a computational demand. These three hypothesis are corroborated by experiments. The main hypothesis of this work is that using these three approaches together is possible to improve significantly the classification performance of a classifier to identify conditions in industrial processes. Experiments have shown such an improvement for the 1-NN classifier in industrial processes used as case study.
    corecore