184 research outputs found
Probabilistic performance estimators for computational chemistry methods: Systematic Improvement Probability and Ranking Probability Matrix. I. Theory
The comparison of benchmark error sets is an essential tool for the
evaluation of theories in computational chemistry. The standard ranking of
methods by their Mean Unsigned Error is unsatisfactory for several reasons
linked to the non-normality of the error distributions and the presence of
underlying trends. Complementary statistics have recently been proposed to
palliate such deficiencies, such as quantiles of the absolute errors
distribution or the mean prediction uncertainty. We introduce here a new score,
the systematic improvement probability (SIP), based on the direct system-wise
comparison of absolute errors. Independently of the chosen scoring rule, the
uncertainty of the statistics due to the incompleteness of the benchmark data
sets is also generally overlooked. However, this uncertainty is essential to
appreciate the robustness of rankings. In the present article, we develop two
indicators based on robust statistics to address this problem: P_{inv}, the
inversion probability between two values of a statistic, and \mathbf{P}_{r},
the ranking probability matrix. We demonstrate also the essential contribution
of the correlations between error sets in these scores comparisons
Probabilistic performance estimators for computational chemistry methods: the empirical cumulative distribution function of absolute errors
Benchmarking studies in computational chemistry use reference datasets to
assess the accuracy of a method through error statistics. The commonly used
error statistics, such as the mean signed and mean unsigned errors, do not
inform end-users on the expected amplitude of prediction errors attached to
these methods. We show that, the distributions of model errors being neither
normal nor zero-centered, these error statistics cannot be used to infer
prediction error probabilities. To overcome this limitation, we advocate for
the use of more informative statistics, based on the empirical cumulative
distribution function of unsigned errors, namely (1) the probability for a new
calculation to have an absolute error below a chosen threshold, and (2) the
maximal amplitude of errors one can expect with a chosen high confidence level.
Those statistics are also shown to be well suited for benchmarking and ranking
studies. Moreover, the standard error on all benchmarking statistics depends on
the size of the reference dataset. Systematic publication of these standard
errors would be very helpful to assess the statistical reliability of
benchmarking conclusions.Comment: Supplementary material: https://github.com/ppernot/ECDF
Can bin-wise scaling improve consistency and adaptivity of prediction uncertainty for machine learning regression ?
Binwise Variance Scaling (BVS) has recently been proposed as a post hoc
recalibration method for prediction uncertainties of machine learning
regression problems that is able of more efficient corrections than uniform
variance (or temperature) scaling. The original version of BVS uses
uncertainty-based binning, which is aimed to improve calibration conditionally
on uncertainty, i.e. consistency. I explore here several adaptations of BVS, in
particular with alternative loss functions and a binning scheme based on an
input-feature (X) in order to improve adaptivity, i.e. calibration conditional
on X. The performances of BVS and its proposed variants are tested on a
benchmark dataset for the prediction of atomization energies and compared to
the results of isotonic regression.Comment: This version corrects an error in the estimation of the Sx scores for
the test set, affecting Fig. 2 and Tables I-III of the initial version. The
main points of the discussion and the conclusions are unchange
Stratification of uncertainties recalibrated by isotonic regression and its impact on calibration error statistics
Abstract Post hoc recalibration of prediction uncertainties of machine
learning regression problems by isotonic regression might present a problem for
bin-based calibration error statistics (e.g. ENCE). Isotonic regression often
produces stratified uncertainties, i.e. subsets of uncertainties with identical
numerical values. Partitioning of the resulting data into equal-sized bins
introduces an aleatoric component to the estimation of bin-based calibration
statistics. The partitioning of stratified data into bins depends on the order
of the data, which is typically an uncontrolled property of calibration
test/validation sets. The tie-braking method of the ordering algorithm used for
binning might also introduce an aleatoric component. I show on an example how
this might significantly affect the calibration diagnostics
Investigating the performance of shear wave elastography for cardiac stiffness assessment through finite element simulations
- …