79,549 research outputs found
Failure prediction models: performance, disagreements, and internal rating systems. NBB Working Papers. No. 123, 13 December 2007
We address a number of comparative issues relating to the performance of failure prediction models for small, private firms. We use two models provided by vendors, a model developed by the National Bank of Belgium, and the Altman Z-score model to investigate model power, the extent of disagreement between models in the ranking of firms, and the design of internal rating systems. We also examine the potential gains from combining the output of multiple models. We find that the power of all four models in predicting bankruptcies is very good at the one-year horizon, even though not all of the models were developed using bankruptcy data and the models use different statistical methodologies. Disagreements in firm rankings are nevertheless significant across models, and model choice will have an impact on loan pricing and origination decisions. We find that it is possible to realize important gains from combining models with similar power. In addition, we show that it can also be beneficial to combine a weaker model with a stronger one if disagreements across models with respect to failing firms are high enough. Finally, the number of classes in an internal rating system appears to be more important than the distribution of borrowers across classes
Robust Classification for Imprecise Environments
In real-world environments it usually is difficult to specify target
operating conditions precisely, for example, target misclassification costs.
This uncertainty makes building robust classification systems problematic. We
show that it is possible to build a hybrid classifier that will perform at
least as well as the best available classifier for any target conditions. In
some cases, the performance of the hybrid actually can surpass that of the best
known classifier. This robust performance extends across a wide variety of
comparison frameworks, including the optimization of metrics such as accuracy,
expected cost, lift, precision, recall, and workforce utilization. The hybrid
also is efficient to build, to store, and to update. The hybrid is based on a
method for the comparison of classifier performance that is robust to imprecise
class distributions and misclassification costs. The ROC convex hull (ROCCH)
method combines techniques from ROC analysis, decision analysis and
computational geometry, and adapts them to the particulars of analyzing learned
classifiers. The method is efficient and incremental, minimizes the management
of classifier performance data, and allows for clear visual comparisons and
sensitivity analyses. Finally, we point to empirical evidence that a robust
hybrid classifier indeed is needed for many real-world problems.Comment: 24 pages, 12 figures. To be published in Machine Learning Journal.
For related papers, see http://www.hpl.hp.com/personal/Tom_Fawcett/ROCCH
How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms?
When sufficient labeled data are available, classical criteria based on
Receiver Operating Characteristic (ROC) or Precision-Recall (PR) curves can be
used to compare the performance of un-supervised anomaly detection algorithms.
However , in many situations, few or no data are labeled. This calls for
alternative criteria one can compute on non-labeled data. In this paper, two
criteria that do not require labels are empirically shown to discriminate
accurately (w.r.t. ROC or PR based criteria) between algorithms. These criteria
are based on existing Excess-Mass (EM) and Mass-Volume (MV) curves, which
generally cannot be well estimated in large dimension. A methodology based on
feature sub-sampling and aggregating is also described and tested, extending
the use of these criteria to high-dimensional datasets and solving major
drawbacks inherent to standard EM and MV curves
Threshold Choice Methods: the Missing Link
Many performance metrics have been introduced for the evaluation of
classification performance, with different origins and niches of application:
accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the
absolute error, and the Brier score (with its decomposition into refinement and
calibration). One way of understanding the relation among some of these metrics
is the use of variable operating conditions (either in the form of
misclassification costs or class proportions). Thus, a metric may correspond to
some expected loss over a range of operating conditions. One dimension for the
analysis has been precisely the distribution we take for this range of
operating conditions, leading to some important connections in the area of
proper scoring rules. However, we show that there is another dimension which
has not received attention in the analysis of performance metrics. This new
dimension is given by the decision rule, which is typically implemented as a
threshold choice method when using scoring models. In this paper, we explore
many old and new threshold choice methods: fixed, score-uniform, score-driven,
rate-driven and optimal, among others. By calculating the loss of these methods
for a uniform range of operating conditions we get the 0-1 loss, the absolute
error, the Brier score (mean squared error), the AUC and the refinement loss
respectively. This provides a comprehensive view of performance metrics as well
as a systematic approach to loss minimisation, namely: take a model, apply
several threshold choice methods consistent with the information which is (and
will be) available about the operating condition, and compare their expected
losses. In order to assist in this procedure we also derive several connections
between the aforementioned performance metrics, and we highlight the role of
calibration in choosing the threshold choice method
ROC-Based Model Estimation for Forecasting Large Changes in Demand
Forecasting for large changes in demand should benefit from different estimation than that used for estimating mean behavior. We develop a multivariate forecasting model designed for detecting the largest changes across many time series. The model is fit based upon a penalty function that maximizes true positive rates along a relevant false positive rate range and can be used by managers wishing to take action on a small percentage of products likely to change the most in the next time period. We apply the model to a crime dataset and compare results to OLS as the basis for comparisons as well as models that are promising for exceptional demand forecasting such as quantile regression, synthetic data from a Bayesian model, and a power loss model. Using the Partial Area Under the Curve (PAUC) metric, our results show statistical significance, a 35 percent improvement over OLS, and at least a 20 percent improvement over competing methods. We suggest management with an increasing number of products to use our method for forecasting large changes in conjunction with typical magnitude-based methods for forecasting expected demand
- …