206 research outputs found
Fun, Not Competition: The Story of My Math Club
For almost three years, I have spent most of my Sunday afternoons doing math with my daughters and a group of their school friends. Below I detail why and how the math club is run. Unlike my day job, which is full of (statistical) learning objectives for my college students, my math club has only the objective that the kids I work with learn to associate mathematics with having fun. My math club has its challenges, but the motivation comes from love of mathematics, which makes it fun, and worth every minute
A method for generating realistic correlation matrices
Simulating sample correlation matrices is important in many areas of
statistics. Approaches such as generating Gaussian data and finding their
sample correlation matrix or generating random uniform deviates as
pairwise correlations both have drawbacks. We develop an algorithm for adding
noise, in a highly controlled manner, to general correlation matrices. In many
instances, our method yields results which are superior to those obtained by
simply simulating Gaussian data. Moreover, we demonstrate how our general
algorithm can be tailored to a number of different correlation models. Using
our results with a few different applications, we show that simulating
correlation matrices can help assess statistical methodology.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS638 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Network Analysis with the Enron Email Corpus
We use the Enron email corpus to study relationships in a network by applying
six different measures of centrality. Our results came out of an in-semester
undergraduate research seminar. The Enron corpus is well suited to statistical
analyses at all levels of undergraduate education. Through this note's focus on
centrality, students can explore the dependence of statistical models on
initial assumptions and the interplay between centrality measures and
hierarchical ranking, and they can use completed studies as springboards for
future research. The Enron corpus also presents opportunities for research into
many other areas of analysis, including social networks, clustering, and
natural language processing.Comment: in Journal of Statistics Education, Volume 23, Number 2, 201
Changes Across 25 Years of Statistics in Medicine
[This piece is a series of interviews with giants in the field of medicine on their views of how statistics is changing medicine. I interviewed the editor of the New England Journal of Medicine, a preeminent doctor/researcher of lung cancer, the director of the LA County Department of Public Health, and a Harvard statistician who sits on the editorial board of the New England Journal of Medicine.
Prediction Error Estimation in Random Forests
In this paper, error estimates of classification Random Forests are
quantitatively assessed. Based on the initial theoretical framework built by
Bates et al. (2023), the true error rate and expected error rate are
theoretically and empirically investigated in the context of a variety of error
estimation methods common to Random Forests. We show that in the classification
case, Random Forests' estimates of prediction error is closer on average to the
true error rate instead of the average prediction error. This is opposite the
findings of Bates et al. (2023) which were given for logistic regression. We
further show that this result holds across different error estimation
strategies such as cross-validation, bagging, and data splitting.Comment: arXiv admin note: text overlap with arXiv:2104.00673 by other author
Differential expression analysis for multiple conditions
As high-throughput sequencing has become common practice, the cost of
sequencing large amounts of genetic data has been drastically reduced, leading
to much larger data sets for analysis. One important task is to identify
biological conditions that lead to unusually high or low expression of a
particular gene. Packages such as DESeq implement a simple method for testing
differential signal when exactly two biological conditions are possible. For
more than two conditions, pairwise testing is typically used. Here the DESeq
method is extended so that three or more biological conditions can be assessed
simultaneously. Because the computation time grows exponentially in the number
of conditions, a Monte Carlo approach provides a fast way to approximate the
-values for the new test. The approach is studied on both simulated data and
a data set of {\em C. jejuni}, the bacteria responsible for most food poisoning
in the United States
Integrating computing in the statistics and data science curriculum: Creative structures, novel skills and habits, and ways to teach computational thinking
Nolan and Temple Lang (2010) argued for the fundamental role of computing in
the statistics curriculum. In the intervening decade the statistics education
community has acknowledged that computational skills are as important to
statistics and data science practice as mathematics. There remains a notable
gap, however, between our intentions and our actions. In this special issue of
the *Journal of Statistics and Data Science Education* we have assembled a
collection of papers that (1) suggest creative structures to integrate
computing, (2) describe novel data science skills and habits, and (3) propose
ways to teach computational thinking. We believe that it is critical for the
community to redouble our efforts to embrace sophisticated computing in the
statistics and data science curriculum. We hope that these papers provide
useful guidance for the community to move these efforts forward.Comment: In press, Journal of Statistics and Data Science Educatio
- …