Principled forecast evaluation and model diagnostics are vital in fitting probabilistic models and forecasting outcomes of interest. A common principle is that fitted or predicted distributions ought to be calibrated, ideally in the sense that the outcome is indistinguishable from a random draw from the posited distribution. Much of this thesis is centered on calibration properties of various types of forecasts.
In the first part of the thesis, a simple algorithm for exact multinomial goodness-of-fit tests is proposed. The algorithm computes exact p-values based on various test statistics, such as the log-likelihood ratio and Pearson\u27s chi-square. A thorough analysis shows improvement on extant methods. However, the runtime of the algorithm grows exponentially in the number of categories and hence its use is limited.
In the second part, a framework rooted in probability theory is developed, which gives rise to hierarchies of calibration, and applies to both predictive distributions and stand-alone point forecasts. Based on a general notion of conditional T-calibration, the thesis introduces population versions of T-reliability diagrams and revisits a score decomposition into measures of miscalibration, discrimination, and uncertainty. Stable and efficient estimators of T-reliability diagrams and score components arise via nonparametric isotonic regression and the pool-adjacent-violators algorithm. For in-sample model diagnostics, a universal coefficient of determination is introduced that nests and reinterprets the classical R2 in least squares regression.
In the third part, probabilistic top lists are proposed as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions. The probabilistic top list functional is elicited by strictly consistent evaluation metrics, based on symmetric proper scoring rules, which admit comparison of various types of predictions