82 research outputs found

    Logistic Ensemble Models

    Get PDF
    Predictive models that are developed in a regulated industry or a regulated application, like determination of credit worthiness must be interpretable and “rational” (e.g., improvements in basic credit behavior must result in improved credit worthiness scores). Machine Learning technologies provide very good performance with minimal analyst intervention, so they are well suited to a high volume analytic environment but the majority are “black box” tools that provide very limited insight or interpretability into key drivers of model performance or predicted model output values. This paper presents a methodology that blends one of the most popular predictive statistical modeling methods with a core model enhancement strategy, found in machine learning. The resulting prediction methodology provides solid performance, from minimal analyst effort, while providing the interpretability and rationality, required in regulated industries

    Binary Classification on Past Due of Service Accounts using Logistic Regression and Decision Tree

    Get PDF
    This paper aims at predicting businesses’ past due in service accounts as well as determining the variables that impact the likelihood of repayment. Two binary classification approaches, logistic regression and the decision tree, were conducted and compared. Both approaches have very good performances with respect to the accuracy. However, the decision tree only uses 10 predictors and reaches an accuracy of 96.69% on the validation set while logistic regression includes 14 predictors and reaches an accuracy of 94.58%. Due to the large concern of false negatives in financial industry, the decision tree technique is a better option than logistic regression on the given dataset in terms of its relative lower false negative. Accuracy, false positive and false negative are all very important criteria in model selection and evaluation. Decision making should rely more on the research purpose, rather than on the exact values of these criteria

    An Analysis of Accuracy using Logistic Regression and Time Series

    Get PDF
    This paper analyzes the accuracy rates for logistic regression and time series models. It also examines a relatively new performance index that takes into consideration the business assumptions of credit markets. Although prior research has focused on evaluation metrics, such as AUC and Gini index, this new measure has a more intuitive interpretation for various managers and decision makers and can be applied to both Logistic and Time Series models

    A Comparison of Machine Learning Techniques and Logistic Regression Method for the Prediction of Past-Due Amount

    Get PDF
    The aim of this paper to predict a past-due amount using traditional and machine learning techniques: Logistic Analysis, k-Nearest Neighbor and Random Forest. The dataset to be analyzed is provided by Equifax, which contains 305 categories of financial information from more than 11,787,287 unique businesses from 2006 to 2014. The big challenge is how to handle with the big and noisy real world datasets. Among the three techniques, the results show that Logistic Regression Method is the best in terms of predictive accuracy and type I errors

    Counting the Impossible: Sampling and Modeling to Achieve a Large State Homeless Count

    Get PDF
    Objective: Using inferential statistics, we develop estimates of the homeless population of a geographically large and economically diverse state -- Georgia. Methods: Multiple independent data sources (2000 U.S. Census, the 2006 Georgia County Guide, Georgia Chamber of Commerce) were used to develop Clusters of the 150 Georgia Counties. These clusters were used as strata to then execute traified sampling. Homeless counts were conducted within the sample counties, allowing for multiple regression models to be developed to generate predictions of homeless persons by county. Results: In response to a mandate from the US Department of Housing and Urban Development, the State of Georgia provided an estimate of its unsheltered homeless population of 12,058 utilizing mathematically validated estimation techniques. Conclusions: Utilization of statistical estimation techniques allowed the State of Georgia to meet the mandate of HUD, while saving the taxpayers of Georgia millions of dollars over a complete state homeless census

    The Evolution of Data Science: A New Mode of Knowledge Production

    Get PDF
    Is data science a new field of study or simply an extension or specialization of a discipline that already exists, such as statistics, computer science, or mathematics? This article explores the evolution of data science as a potentially new academic discipline, which has evolved as a function of new problem sets that established disciplines have been ill-prepared to address. The authors find that this newly-evolved discipline can be viewed through the lens of a new mode of knowledge production and is characterized by transdisciplinarity collaboration with the private sector and increased accountability. Lessons from this evolution can inform knowledge production in other traditional academic disciplines as well as inform established knowledge management practices grappling with the emerging challenges of Big Data

    Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models

    Get PDF
    In bankruptcy prediction, the proportion of events is very low, which is often oversampled to eliminate this bias. In this paper, we study the influence of the event rate on discrimination abilities of bankruptcy prediction models. First the statistical association and significance of public records and firmographics indicators with the bankruptcy were explored. Then the event rate was oversampled from 0.12% to 10%, 20%, 30%, 40%, and 50%, respectively. Seven models were developed, including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine, Bayesian Network, and Neural Network. Under different event rates, models were comprehensively evaluated and compared based on Kolmogorov-Smirnov Statistic, accuracy, F1 score, Type I error, Type II error, and ROC curve on the hold-out dataset with their best probability cut-offs. Results show that Bayesian Network is the most insensitive to the event rate, while Support Vector Machine is the most sensitive

    COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICS

    Get PDF
    Many business operations and strategies rely on bankruptcy prediction. In this paper, we aim to study the impacts of public records and firmographics and predict the bankruptcy in a 12-month-ahead period with using different classification models and adding values to traditionally used financial ratios. Univariate analysis shows the statistical association and significance of public records and firmographics indicators with the bankruptcy. Further, seven statistical models and machine learning methods were developed, including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine, Bayesian Network, and Neural Network. The performance of models were evaluated and compared based on classification accuracy, Type I error, Type II error, and ROC curves on the hold-out dataset. Moreover, an experiment was set up to show the importance of oversampling for rare event prediction. The result also shows that Bayesian Network is comparatively more robust than other models without oversampling

    Application of Isotonic Regression in Predicting Business Risk Scores

    Get PDF
    An isotonic regression model fits an isotonic function of the explanatory variables to estimate the expectation of the response variable. In other words, as the function increases, the estimated expectation of the response must be non-decreasing. With this characteristic, isotonic regression could be a suitable option to analyze and predict business risk scores. A current challenge of isotonic regression is the decrease of performance when the model is fitted in a large data set e.g. more than four or five dimensions. This paper attempts to apply isotonic regression models into prediction of business risk scores using a large data set – approximately 50 numeric variables and 24 million observations. Evaluations are based on comparing the new models with a traditional logistic regression model built for the same data set. The primary finding is that isotonic regression using distance aggregate functions does not outperform logistic regression. The performance gap is narrow however, suggesting that isotonic regression may still be used if necessary since isotonic regression may achieve better convergence speed in massive data sets

    The Validity of Online Patient Ratings of Physicians

    Get PDF
    Background: Information from ratings sites are increasingly informing patient decisions related to health care and the selection of physicians. Objective: The current study sought to determine the validity of online patient ratings of physicians through comparison with physician peer review. Methods: We extracted 223,715 reviews of 41,104 physicians from 10 of the largest cities in the United States, including 1142 physicians listed as “America’s Top Doctors” through physician peer review. Differences in mean online patient ratings were tested for physicians who were listed and those who were not. Results: Overall, no differences were found between the online patient ratings based upon physician peer review status. However, statistical differences were found for four specialties (family medicine, allergists, internal medicine, and pediatrics), with online patient ratings significantly higher for those physicians listed as a peer-reviewed “Top Doctor” versus those who were not. Conclusions: The results of this large-scale study indicate that while online patient ratings are consistent with physician peer review for four nonsurgical, primarily in-office specializations, patient ratings were not consistent with physician peer review for specializations like anesthesiology. This result indicates that the validity of patient ratings varies by medical specialization
    • …
    corecore