7 research outputs found
Plot of influenza data.
<p>The full data include observations aggregated to the national level and for ten smaller regions. Here we plot only the data at the national level and in two of the smaller regions; data for the other regions are qualitatively similar. Missing data are indicated with vertical blue bars. The vertical red dashed lines indicate the cutoff time between the training and testing phases; five seasons of data were held out for testing.</p
Permutation test results for pairwise comparisons of the mean log scores for each method.
<p>For each combination of 3 prediction targets, 11 regions, and 5 test phase seasons, we calculated the mean log score for all predictions made by each method in weeks before the event being predicted occurred. Panel A presents the overall mean of these values for each method; higher mean log scores indicate better performance. Panel B displays the difference in mean log scores for each pair of models. Positive values indicate that the model on the vertical axis outperformed the model on the horizontal axis on average. A permutation test was used to obtain approximate p-values for these differences (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005910#pcbi.1005910.s001" target="_blank">S1 Text</a> for details). For reference, a Bonferroni correction at a familywise significance level of 0.05 for all pairwise comparisons leads to a significance cutoff of approximately 0.0014.</p
Density plots summarizing differences in mean log scores for each method relative to the median model.
<p>We calculate the difference in log scores for a given method and the method with median performance for each combination of prediction target, region, and test phase season; each density curve summarizes results across all 165 combinations of 3 prediction targets, 11 regions, and 5 test phase seasons. Positive values indicate better performance than the median model. For legibility, we only show results for the two component models with best mean performance (KCDE and SARIMA) and for the two ensemble models with best mean performance (CW and FW-reg-w).</p
Summary of ensemble methods and what the model weights depend on.
<p>Summary of ensemble methods and what the model weights depend on.</p
Example of component model weights from the CW and FW-reg-w models.
<p>Model weights are shown for predictions of onset timing (panel A), peak timing (panel B), and peak incidence (panel C) at the national level. The upper plot within each panel shows mean, minimum, and maximum log scores achieved by each component model for predictions of the given prediction target at the national level in each week of the season, summarizing across all seasons in the training phase when all three component models produced predictions. The lower plot within each panel shows model weights from the textbfCW and textbfFW-reg-w ensemble methods at each week in the season.</p
Permutation test results for comparisons of the minimum difference in mean log scores relative to the median for each pair of methods.
<p>For each combination of 3 prediction targets, 11 regions, and 5 test phase seasons, we calculated the difference in mean log scores between each method and the method with median performance for that target, region, and season. Panel A presents the minimum difference from the median model for each method across all combinations of target, region, and season. Larger values of this quantity indicate that the given model has better worst-case performance. Panel B displays the difference in this measure of worst-case performance for each pair of models. Positive values indicate that the model on the vertical axis had better worst-case performance than the model on the horizontal axis. A permutation test was used to obtain approximate p-values for these differences (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005910#pcbi.1005910.s001" target="_blank">S1 Text</a> for details). For reference, a Bonferroni correction at a familywise significance level of 0.05 for all pairwise comparisons leads to a significance cutoff of approximately 0.0014.</p
Conceptual diagram of how the stacking models operate on probabilistic predictive distributions.
<p>The distributions illustrated here have density bins of 1 wILI unit, which differs from those used in the manuscript for illustrative purposes only. Panel A shows the predictive distributions from three component models. Panel B shows scaled versions of the distributions from A, after being multiplied by model weights. In Panel C, the scaled distributions are literally stacked to create the final ensemble predictive distribution.</p