4 research outputs found

    Análise automática de crédito : desempenho da mineração de dados de uma árvore de decisão

    Get PDF
    Este estudo apresenta a análise de mineração de dados de uma árvore de decisão utilizada para aprovação ou negativa automática de crédito em uma instituição financeira. A partir do conteúdo inicial, é proposto que seja validada a efetividade da análise de crédito automática, objetivando verificar a assertividade das variáveis utilizadas no modelo estatístico que baseia esta analise, de forma a garantir ganhos de eficiência e redução de inadimplência para a instituição. Para tal, este estudo analisa o desempenho histórico dos clientes baseado nas variáveis definidas, de maneira a comprovar eficácia do modelo e verificar os casos de inadimplência. Inicialmente, é apresentada a metodologia de decisão automática e o contexto da instituição financeira que a utiliza, dissertando acerca do conceito dos principais produtos e fontes de recurso das operações de crédito. A base de dados da instituição é aplicada a um modelo linear generalizado com função probit, para que sejam encontradas as variáveis que possuem maior significância na ocorrência de default. A partir do resultado da regressão, o estudo propõe ajustes a serem aplicados na árvore de decisão, para confirmar os ganhos de eficiência que a análise automática possui quando comparada à análise manual.This study presents the data mining analysis of a decision tree used in automatic credit approval or declines in a financial institution. By the initial content, it is proposed to validate the effectiveness of automatic credit analysis, aiming to verify the assertiveness of the variables used in the statistic model that is the base of this analysis, in order to ensure efficiency gains and decrease in delinquency level for the institution. For this purpose, this study examines the historical performance of clients based in the variables defined in the current model, in order to prove the effectiveness of the model and check the default cases. At first, the automatic decision methodology is presented with the context of the financial institution which uses this approach, discussing on the concept of the main financial products and funding resources of the credit deals. The database of the institution is applied to a Linear Generalized Model with probit function, in order to find the variables that present the most significance level in default occurrences. By the regression result, this study purposes adjustments to be applied in the decision tree, in order to confirm the efficiency gains that automatic decision has when compared to manual decision

    The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

    Get PDF
    Current classification approaches usually do not try to achieve a balance between fitting and generalization when they infer models from training data. Such approaches ignore the possibility of different penalty costs for the false-positive, false-negative, and unclassifiable types. Thus, their performances may not be optimal or may even be coincidental. This dissertation analyzes the above issues in depth. It also proposes two new approaches called the Homogeneity-Based Algorithm (HBA) and the Convexity-Based Algorithm (CBA) to address these issues. These new approaches aim at optimally balancing the data fitting and generalization behaviors of models when some traditional classification approaches are used. The approaches first define the total misclassification cost (TC) as a weighted function of the three penalty costs and their corresponding error rates. The approaches then partition the training data into regions. In the HBA, the partitioning is done according to some homogeneous properties derivable from the training data. Meanwhile, the CBA employs some convex properties to derive regions. A traditional classification method is then used in conjunction with the HBA and CBA. Finally, the approaches apply a genetic approach to determine the optimal levels of fitting and generalization. The TC serves as the fitness function in this genetic approach. Real-life datasets from a wide spectrum of domains were used to better understand the effectiveness of the HBA and CBA. The computational results have indicated that both the HBA and CBA might potentially fill a critical gap in the implementation of current or future classification approaches. Furthermore, the results have also shown that when the penalty cost of an error type was changed, the corresponding error rate followed stepwise patterns. The finding of stepwise patterns of classification errors can assist researchers in determining applicable penalties for classification errors. Thus, the dissertation also proposes a binary search approach (BSA) to produce those patterns. Real-life datasets were utilized to demonstrate for the BSA

    Automated Machine Learning: Intellient Binning Data Preparation and Regularized Regression Classfier

    Get PDF
    Automated machine learning (AutoML) has become a new trend which is the process of automating the complete pipeline from the raw dataset to the development of machine learning model. It not only can relief data scientists\u27 works but also allows non-experts to finish the jobs without solid knowledge and understanding of statistical inference and machine learning. One limitation of AutoML framework is the data quality differs significantly batch by batch. Consequently, fitted model quality for some batches of data can be very poor due to distribution shift for some numerical predictors. In this dissertation, we develop an intelligent binning to resolve this problem. In addition, various regularized regression classifiers (RRCs) including Ridge, Lasso and Elastic Net regression have been tested to enhance model performance further after binning. We focus on the binary classification problem and have developed an AutoML framework using Python to handle the entire data preparation process including data partition and intelligent binning. This system has been tested extensively by simulations and real datasets analyses and the results have shown that (1) All the models perform better with intelligent binding for both balanced and imbalance binary classification problem. (2) Regression-based methods are more sensitive than tree-based methods using intelligent binning. RRCs can work better than other tree methods by using intelligent binning technique. (3) Weighted RRC can obtain the best results compared to other methods. (4) Our framework is an effective and reliable tool to conduct AutoML
    corecore