Infectious disease modelling and forecasting has garnered broad interest throughout the COVID-19 pandemic. Accurate forecasts for the trajectory of the pandemic can be useful for informing public policy and public health interventions. In this, forecast evaluation plays a crucial role. Forecasts are only useful if they are accurate. Evaluating the performance of different forecasting approaches can provide information about their trustworthiness, as well as on how to improve them. This thesis makes contributions in two areas related to forecasting and forecast evaluation in an epidemiological context. Firstly, it advances the tools available as well as our theoretical understanding of how to evaluate forecasts of infectious diseases. Secondly, it investigates the relative performance and interplay of human judgement and mathematical modelling in the context of short-term forecasts of COVID-19. With respect to forecast evaluation, the first contribution made by this thesis is scoringutils, an R package that facilitates the evaluation process. The package provides a coherent framework for forecast evaluation in R and implements a selection of scoring rules, helper functions and visualisations. In particular, it supports evaluating forecasts in a quantile-based format that has recently been used by several COVID-19 Forecast Hubs in the US, Europe, and Germany and Poland. The second contribution to the field of forecast evaluation is a novel approach to evaluating forecasts in an epidemiological context. Scores like the continuous ranked probability score (CRPS) or the weighted interval score (WIS), which are common in epidemiology, represent a generalisation of the absolute error to predictive distributions. However, determining predictive performance based on the absolute distance between forecast and observation neglects the exponential nature of infectious disease processes. Transforming forecasts and observations using the natural logarithm before applying the CRPS or WIS may be more adequate in an epidemiological context. The resulting score can be understood as a probabilistic version of the relative error. It measures predictive performance in terms of the exponential growth rate and can serve as a variance-stabilising transformation assuming that the underlying disease process has a quadratic mean-variance relationship. This thesis motivates the idea of transforming forecasts before evaluating them and illustrates the behaviour of these scores using data from the European COVID-19 Forecast Hub. Log-transforming forecasts before scoring them changed the ranking between forecasters and resulted in scores that were more evenly distributed across time and space. With respect to the role of human judgement in infectious disease forecasting, this thesis contributes two studies that analyse and compare the predictive performance of human judgement forecasts and model-based predictions. It starts from the understanding that computational modelling, which has been the predominant way to obtain infectious disease forecasts in the past, represents a synthesis between epidemiological and mathematical assumptions and the expertise and judgement of the researchers fine-tuning the models. Understanding the interplay between human judgement and mathematical modelling better, as well as trade-offs between the two, may help make future forecasting efforts more efficient and improve predictive accuracy. This thesis uses the newly developed forecast evaluation tools to investigate the interplay between human judgement and mathematical modelling in the context of infectious disease forecasting, specifically of COVID-19. In a first study, it elicited forecasts from researchers and laypersons and compared these human judgement forecasts against predictions from a minimally-tuned mathematical model, as well as from an ensemble of several computational models submitted to the German and Polish COVID-19 Forecast Hub. It found that human judgement forecasts generally performed on par with the ensemble of computational models, performing slightly better when predicting cases and slightly worse when predicting deaths. Adding more forecasts to the ensemble was generally advantageous, even if the model to be added performed worse than the already existing ensemble. A second study replicates the basic set-up and compared human judgement forecasts of COVID-19 in the UK, elicited as part of a public “UK Crowd Forecasting Challenge”, against the ensemble of all forecasts submitted to the European COVID-19 Forecast Hub. Again, forecasts performed broadly on par with the ensemble forecasts. We did not find a strong difference between self-selected “experts” and “non-experts” in terms of predictive performance. Results should generally be interpreted carefully, due to small sample sizes and susceptibility to choices made in the evaluation process. We explored a novel way to combine human judgement and mathematical modelling by asking forecasters to predict the effective reproduction number Rt which then got mapped to case and death numbers using an epidemiological model. Due to various limitations, the initial performance with this new approach was worse than that of direct human forecasts. Nevertheless, approaches that combine human judgement and mathematical modelling are promising as they could help reduce the cognitive burden of the forecasters and increase accuracy