62 research outputs found

    In Memoriam : Elart von Collan

    Get PDF

    Evaluating Probabilistic Classifiers: The Triptych

    Full text link
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science

    Estimating the Rate-Distortion Function by Wasserstein Gradient Descent

    Full text link
    In the theory of lossy compression, the rate-distortion (R-D) function R(D)R(D) describes how much a data source can be compressed (in bit-rate) at any given level of fidelity (distortion). Obtaining R(D)R(D) for a given data source establishes the fundamental performance limit for all compression algorithms. We propose a new method to estimate R(D)R(D) from the perspective of optimal transport. Unlike the classic Blahut--Arimoto algorithm which fixes the support of the reproduction distribution in advance, our Wasserstein gradient descent algorithm learns the support of the optimal reproduction distribution by moving particles. We prove its local convergence and analyze the sample complexity of our R-D estimator based on a connection to entropic optimal transport. Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. We also highlight a connection to maximum-likelihood deconvolution and introduce a new class of sources that can be used as test cases with known solutions to the R-D problem.Comment: Accepted as conference paper at NeurIPS 202

    Statistical and Computational Aspects of Learning with Complex Structure

    Get PDF
    The recent explosion of data that is routinely collected has led scientists to contemplate more and more sophisticated structural assumptions. Understanding how to harness and exploit such structure is key to improving the prediction accuracy of various statistical procedures. The ultimate goal of this line of research is to develop a set of tools that leverage underlying complex structures to pool information across observations and ultimately improve statistical accuracy as well as computational efficiency of the deployed methods. The workshop focused on recent developments in regression and matrix estimation under various complex constraints such as physical, computational, privacy, sparsity or robustness. Optimal-transport based techniques for geometric data analysis were also a main topic of the workshop

    The Impact of an Instructional Intervention Designed to Support Development of Stochastic Understanding of Probability Distribution

    Get PDF
    Stochastic understanding of probability distribution undergirds development of conceptual connections between probability and statistics and supports development of a principled understanding of statistical inference. This study investigated the impact of an instructional course intervention designed to support development of stochastic understanding of probability distribution. Instructional supports consisted of supplemental lab assignments comprised of anticipatory tasks designed to engage students in coordinating thinking about complementary probabilistic and statistical notions. These tasks utilized dynamic software simulations to elicit stochastic conceptions and to support development of conceptual connections between empirical distributions and theoretical probability distribution models along a hypothetical learning trajectory undergirding stochastic understanding of probability distribution. The study employed a treatment-control design, using a mix of quantitative and qualitative research methods to examine students' understanding after a one-semester course. Participants were 184 undergraduate students enrolled in a lecture/recitation, calculus-based, introductory probability and statistics course who completed lab assignments addressing either calculus review (control) or stochastic conceptions of probability distribution (treatment). Data sources consisted of a student background survey, a conceptual assessment, ARTIST assessment items, and final course examinations. Student interviews provided insight into the nature of students' reasoning and facilitated examination of validity of the stochastic conceptual assessment. Logistic regression analysis revealed completion of supplemental assignments designed to undergird development of stochastic conceptions had a statistically significant impact on students' understanding of probability distribution. Students who held stochastic conceptions indicated integrated reasoning related to probability, variability, and distribution and presented images which support a principled understanding of statistical inference

    CWI Self-evaluation 1999-2004

    Get PDF

    Modelling of Viral Disease Risk

    Get PDF
    Covid-19 has had a significant impact on daily life since the initial outbreak of the global pandemic in late 2019. Countries have been affected to varying degrees, depending on government actions and country characteristics such as infrastructure and demographics. Using Norway and Germany as a case study, this thesis aims to determine which factors influence the risk of infection in each country, using Bayesian modelling and a non-Bayesian machine learning approach. Specifically, the relationship between infection rates and demographic and infrastructural characteristics in a municipality at a fixed point in time is investigated and the effectiveness of a Bayesian model in this context is compared with a machine learning algorithm. In addition, temporal modelling is used to assess the usefulness of government interventions, the impact of changes in mobility behaviour and the prevalence of different strains of Covid-19 in relation to infection numbers. The results show that a spatial model is more useful than a machine learning model in this context. For Germany, it is found that the logarithmic trade tax in a municipality, the share of the vote for the right-wing AfD party and the population density have a positive influence on the infection figures. For Norway, the number of immigrants in a municipality, the number of unemployed immigrants in a municipality and population density are found to have a positive association with infection rates, while the proportion of women in a municipality is negatively associated with infection rates. The temporal models identify higher workplace mobility as a factor significantly influencing the risk of infection in Germany and Norway
    corecore