14 research outputs found
Student evaluations of teaching are not only unreliable, they are significantly biased against female instructors
A series of studies across countries and disciplines in higher education confirm that student evaluations of teaching (SET) are significantly correlated with instructor gender, with students regularly rating female instructors lower than male peers. Anne Boring, Kellie Ottoboni and Philip B. Stark argue the findings warrant serious attention in light of increasing pressure on universities to measure teaching effectiveness. Given the unreliability of the metric and the harmful impact these evaluations can have, universities should think carefully on the role of such evaluations in decision-making
Recommended from our members
Classical Nonparametric Hypothesis Tests with Applications in Social Good
Hypothesis testing has come under fire in the past decade as misuses have become increas- ingly visible. It is common to use tests whose assumptions don’t reflect how the data were collected, and editorial policies of many journals reward “p-hacking” by setting the arbitrary threshold of 0.05 to determine whether a result merits publication. In fact, properly designed hypothesis tests are an invaluable tool for inference and decision-making. Classical nonparametric tests, once reserved for problems that could be worked out with pencil and paper or approximated asymptotically, can now be applied to complex datasets with the help of modern computing power. This dissertation tailors some nonparametric tests to modern applications for social good.Permutation tests are a class of hypothesis tests for data that involve random (or plausibly random) assignment. The parametric assumptions for common tests, like the t-test and linear regression, may not hold for randomized experiments; in contrast, the assumptions of permutation tests are implied by the experimental design. But off-the-shelf permutation tests are not a panacea: tests must be tailored to fit the experimental design, and there are subtle numerical issues with implementing the tests in software. We construct permutation tests and software to address particular questions in randomized and natural experiments, including identifying what, if anything, student evaluations of teaching measure, and whether voting machines malfunctioned in Georgia’s November 2018 election.Risk-limiting post-election audits (RLAs) have existed for a decade, but have not been adopted widely, in part due to logistical hurdles. This thesis uses classical nonparametric techniques, including Fisher’s combination method and Wald’s sequential probability ratio test, to build new RLA methods that accommodate the idiosyncratic logistics of statewide elections. A new, more flexible method for using stratified samples in RLAs makes it easier and more efficient to audit elections conducted on heterogeneous voting equipment. This thesis also develops an RLA method based on Bernoulli sampling, which allows ballots to be audited “in parallel” across precincts on Election Day. The RLA method for stratified samples of ballots was piloted in Michigan to study its performance in the face of real-world constraints
Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness
Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show: SET are biased against female instructors by an amount that is large and statistically significant the bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded the bias varies by discipline and by student gender, among other things it is not possible to adjust for the bias, because it depends on so many factors SET are more sensitive to students' gender bias and grade expectations than they are to teaching effectiveness gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors.These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sections of an online course in a randomized, controlled, blind experiment at a US university.</p
Estimating population average treatment effects from experiments with noncompliance
Randomized control trials (RCTs) are the gold standard for estimating causal effects, but often use samples that are non-representative of the actual population of interest. We propose a reweighting method for estimating population average treatment effects in settings with noncompliance. Simulations show the proposed compliance-adjusted population estimator outperforms its unadjusted counterpart when compliance is relatively low and can be predicted by observed covariates. We apply the method to evaluate the effect of Medicaid coverage on health care use for a target population of adults who may benefit from expansions to the Medicaid program. We draw RCT data from the Oregon Health Insurance Experiment, where less than one-third of those randomly selected to receive Medicaid benefits actually enrolled