206 research outputs found

    Fun, Not Competition: The Story of My Math Club

    Get PDF
    For almost three years, I have spent most of my Sunday afternoons doing math with my daughters and a group of their school friends. Below I detail why and how the math club is run. Unlike my day job, which is full of (statistical) learning objectives for my college students, my math club has only the objective that the kids I work with learn to associate mathematics with having fun. My math club has its challenges, but the motivation comes from love of mathematics, which makes it fun, and worth every minute

    A method for generating realistic correlation matrices

    Get PDF
    Simulating sample correlation matrices is important in many areas of statistics. Approaches such as generating Gaussian data and finding their sample correlation matrix or generating random uniform [−1,1][-1,1] deviates as pairwise correlations both have drawbacks. We develop an algorithm for adding noise, in a highly controlled manner, to general correlation matrices. In many instances, our method yields results which are superior to those obtained by simply simulating Gaussian data. Moreover, we demonstrate how our general algorithm can be tailored to a number of different correlation models. Using our results with a few different applications, we show that simulating correlation matrices can help assess statistical methodology.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS638 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Network Analysis with the Enron Email Corpus

    Full text link
    We use the Enron email corpus to study relationships in a network by applying six different measures of centrality. Our results came out of an in-semester undergraduate research seminar. The Enron corpus is well suited to statistical analyses at all levels of undergraduate education. Through this note's focus on centrality, students can explore the dependence of statistical models on initial assumptions and the interplay between centrality measures and hierarchical ranking, and they can use completed studies as springboards for future research. The Enron corpus also presents opportunities for research into many other areas of analysis, including social networks, clustering, and natural language processing.Comment: in Journal of Statistics Education, Volume 23, Number 2, 201

    Microarray Data from a Statistician’s Point of View

    Get PDF

    Changes Across 25 Years of Statistics in Medicine

    Get PDF
    [This piece is a series of interviews with giants in the field of medicine on their views of how statistics is changing medicine. I interviewed the editor of the New England Journal of Medicine, a preeminent doctor/researcher of lung cancer, the director of the LA County Department of Public Health, and a Harvard statistician who sits on the editorial board of the New England Journal of Medicine.

    Prediction Error Estimation in Random Forests

    Full text link
    In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which were given for logistic regression. We further show that this result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.Comment: arXiv admin note: text overlap with arXiv:2104.00673 by other author

    Differential expression analysis for multiple conditions

    Full text link
    As high-throughput sequencing has become common practice, the cost of sequencing large amounts of genetic data has been drastically reduced, leading to much larger data sets for analysis. One important task is to identify biological conditions that lead to unusually high or low expression of a particular gene. Packages such as DESeq implement a simple method for testing differential signal when exactly two biological conditions are possible. For more than two conditions, pairwise testing is typically used. Here the DESeq method is extended so that three or more biological conditions can be assessed simultaneously. Because the computation time grows exponentially in the number of conditions, a Monte Carlo approach provides a fast way to approximate the pp-values for the new test. The approach is studied on both simulated data and a data set of {\em C. jejuni}, the bacteria responsible for most food poisoning in the United States

    Integrating computing in the statistics and data science curriculum: Creative structures, novel skills and habits, and ways to teach computational thinking

    Full text link
    Nolan and Temple Lang (2010) argued for the fundamental role of computing in the statistics curriculum. In the intervening decade the statistics education community has acknowledged that computational skills are as important to statistics and data science practice as mathematics. There remains a notable gap, however, between our intentions and our actions. In this special issue of the *Journal of Statistics and Data Science Education* we have assembled a collection of papers that (1) suggest creative structures to integrate computing, (2) describe novel data science skills and habits, and (3) propose ways to teach computational thinking. We believe that it is critical for the community to redouble our efforts to embrace sophisticated computing in the statistics and data science curriculum. We hope that these papers provide useful guidance for the community to move these efforts forward.Comment: In press, Journal of Statistics and Data Science Educatio
    • …
    corecore