99 research outputs found
An extended note on the multibin logarithmic score used in the FluSight competitions
In recent years the Centers for Disease Control and Prevention (CDC) have
organized FluSight influenza forecasting competitions. To evaluate the
participants' forecasts a multibin logarithmic score has been created, which is
a non-standard variant of the established logarithmic score. Unlike the
original log score, the multibin version is not proper and may thus encourage
dishonest forecasting. We explore the practical consequences this may have,
using forecasts from the 2016/17 FluSight competition for illustration.Comment: This note elaborates on a letter published in PNAS: Bracher, J: On
the multibin logarithmic score used in the FluSight competitions. PNAS first
published September 26, 2019 https://doi.org/10.1073/pnas.191214711
Evaluating epidemic forecasts in an interval format
For practical reasons, many forecasts of case, hospitalization and death
counts in the context of the current COVID-19 pandemic are issued in the form
of central predictive intervals at various levels. This is also the case for
the forecasts collected in the COVID-19 Forecast Hub
(https://covid19forecasthub.org/). Forecast evaluation metrics like the
logarithmic score, which has been applied in several infectious disease
forecasting challenges, are then not available as they require full predictive
distributions. This article provides an overview of how established methods for
the evaluation of quantile and interval forecasts can be applied to epidemic
forecasts in this format. Specifically, we discuss the computation and
interpretation of the weighted interval score, which is a proper score that
approximates the continuous ranked probability score. It can be interpreted as
a generalization of the absolute error to probabilistic forecasts and allows
for a decomposition into a measure of sharpness and penalties for over- and
underprediction
Learning to forecast: The probabilistic time series forecasting challenge
We report on a course project in which students submit weekly probabilistic
forecasts of two weather variables and one financial variable. This real-time
format allows students to engage in practical forecasting, which requires a
diverse set of skills in data science and applied statistics. We describe the
context and aims of the course, and discuss design parameters like the
selection of target variables, the forecast submission process, the evaluation
of forecast performance, and the feedback provided to students. Furthermore, we
describe empirical properties of students' probabilistic forecasts, as well as
some lessons learned on our part
Scoring epidemiological forecasts on transformed scales
Forecast evaluation is essential for the development of predictive epidemic models and can inform their use for public health decision-making. Common scores to evaluate epidemiological forecasts are the Continuous Ranked Probability Score (CRPS) and the Weighted Interval Score (WIS), which can be seen as measures of the absolute distance between the forecast distribution and the observation. However, applying these scores directly to predicted and observed incidence counts may not be the most appropriate due to the exponential nature of epidemic processes and the varying magnitudes of observed values across space and time. In this paper, we argue that transforming counts before applying scores such as the CRPS or WIS can effectively mitigate these difficulties and yield epidemiologically meaningful and easily interpretable results. Using the CRPS on log-transformed values as an example, we list three attractive properties: Firstly, it can be interpreted as a probabilistic version of a relative error. Secondly, it reflects how well models predicted the time-varying epidemic growth rate. And lastly, using arguments on variance-stabilizing transformations, it can be shown that under the assumption of a quadratic mean-variance relationship, the logarithmic transformation leads to expected CRPS values which are independent of the order of magnitude of the predicted quantity. Applying a transformation of log(x + 1) to data and forecasts from the European COVID-19 Forecast Hub, we find that it changes model rankings regardless of stratification by forecast date, location or target types. Situations in which models missed the beginning of upward swings are more strongly emphasised while failing to predict a downturn following a peak is less severely penalised when scoring transformed forecasts as opposed to untransformed ones. We conclude that appropriate transformations, of which the natural logarithm is only one particularly attractive option, should be considered when assessing the performance of different models in the context of infectious disease incidence
Human judgement forecasting of COVID-19 in the UK
Background
In the past, two studies found ensembles of human judgement forecasts of COVID-19 to show predictive performance comparable to ensembles of computational models, at least when predicting case incidences. We present a follow-up to a study conducted in Germany and Poland and investigate a novel joint approach to combine human judgement and epidemiological modelling.
Methods
From May 24th to August 16th 2021, we elicited weekly one to four week ahead forecasts of cases and deaths from COVID-19 in the UK from a crowd of human forecasters. A median ensemble of all forecasts was submitted to the European Forecast Hub. Participants could use two distinct interfaces: in one, forecasters submitted a predictive distribution directly, in the other forecasters instead submitted a forecast of the effective reproduction number Rt . This was then used to forecast cases and deaths using simulation methods from the EpiNow2 R package. Forecasts were scored using the weighted interval score on the original forecasts, as well as after applying the natural logarithm to both forecasts and observations.
Results
The ensemble of human forecasters overall performed comparably to the official European Forecast Hub ensemble on both cases and deaths, although results were sensitive to changes in details of the evaluation. Rt forecasts performed comparably to direct forecasts on cases, but worse on deaths. Self-identified “experts” tended to be better calibrated than “non-experts” for cases, but not for deaths.
Conclusions
Human judgement forecasts and computational models can produce forecasts of similar quality for infectious disease such as COVID-19. The results of forecast evaluations can change depending on what metrics are chosen and judgement on what does or doesn\u27t constitute a "good" forecast is dependent on the forecast consumer. Combinations of human and computational forecasts hold potential but present real-world challenges that need to be solved
- …