78 research outputs found
Lasso adjustments of treatment effect estimates in randomized experiments
We provide a principled way for investigators to analyze randomized
experiments when the number of covariates is large. Investigators often use
linear multivariate regression to analyze randomized experiments instead of
simply reporting the difference of means between treatment and control groups.
Their aim is to reduce the variance of the estimated treatment effect by
adjusting for covariates. If there are a large number of covariates relative to
the number of observations, regression may perform poorly because of
overfitting. In such cases, the Lasso may be helpful. We study the resulting
Lasso-based treatment effect estimator under the Neyman-Rubin model of
randomized experiments. We present theoretical conditions that guarantee that
the estimator is more efficient than the simple difference-of-means estimator,
and we provide a conservative estimator of the asymptotic variance, which can
yield tighter confidence intervals than the difference-of-means estimator.
Simulation and data examples show that Lasso-based adjustment can be
advantageous even when the number of covariates is less than the number of
observations. Specifically, a variant using Lasso for selection and OLS for
estimation performs particularly well, and it chooses a smoothing parameter
based on combined performance of Lasso and OLS
Seasonal changes in the concentrations of dissolved oxygen in the lakes of the “Bory Tucholskie” National Park
The article presents the results of the examinations of dissolved oxygen vertical distribution (DO) in the deepest places of the lakes conducted at different times in the years 2003-2005, and even earlier. The authors draw particular attention to severe
oxygen deficits in the deepest places of the lakes, both those deep and shallow lakes despite the fact that they are not so exposed to anthropopressure. They also point out to the similarity of the course of oxygen curves in the same lakes and seasons in consecutive years, and also differences between particular lakes. They have also determined minor correlation between the mean concentration of DO in
the vertical distribution and the duration of period with the ice cover (R2=0.78)
DP-TBART: A Transformer-based Autoregressive Model for Differentially Private Tabular Data Generation
The generation of synthetic tabular data that preserves differential privacy
is a problem of growing importance. While traditional marginal-based methods
have achieved impressive results, recent work has shown that deep
learning-based approaches tend to lag behind. In this work, we present
Differentially-Private TaBular AutoRegressive Transformer (DP-TBART), a
transformer-based autoregressive model that maintains differential privacy and
achieves performance competitive with marginal-based methods on a wide variety
of datasets, capable of even outperforming state-of-the-art methods in certain
settings. We also provide a theoretical framework for understanding the
limitations of marginal-based approaches and where deep learning-based
approaches stand to contribute most. These results suggest that deep
learning-based techniques should be considered as a viable alternative to
marginal-based methods in the generation of differentially private synthetic
tabular data
Data reliability in citizen science: learning curve and the effects of training method, volunteer background and experience on identification accuracy of insects visiting ivy flowers
• Citizen science, the involvement of volunteers in collecting of scientific data, can be a useful research tool. However, data collected by volunteers are often of lower quality than that collected by professional scientists.
• We studied the accuracy with which volunteers identified insects visiting ivy (Hedera) flowers in Sussex, England. In the first experiment, we examined the effects of training method, volunteer background and prior experience. Fifty-three participants were trained for the same duration using one of three different methods (pamphlet, pamphlet + slide show, pamphlet + direct training). Almost immediately following training, we tested the ability of participants to identify live insects on ivy flowers to one of 10 taxonomic categories and recorded whether their identifications were correct or incorrect, without providing feedback.
• The results showed that the type of training method had a significant effect on identification accuracy (P = 0.008). Participants identified 79.1% of insects correctly after using a one-page colour pamphlet, 85.6% correctly after using the pamphlet and viewing a slide show, and 94.3% correctly after using the pamphlet in combination with direct training in the field.
• As direct training cannot be delivered remotely, in the following year we conducted a second experiment, in which a different sample of 26 volunteers received the pamphlet plus slide show training repeatedly three times. Moreover, in this experiment participants received c. 2 minutes of additional training material, either videos of insects or stills taken from the videos. Testing showed that identification accuracy increased from 88.6% to 91.3% to 97.5% across the three successive tests. We also found a borderline significant interaction between the type of additional material and the test number (P = 0.053), such that the video gave fewer errors than stills in the first two tests only.
• The most common errors made by volunteers were misidentifications of honey bees and social wasps with their hover fly mimics. We also tested six experts who achieved nearly perfect accuracy (99.8%), which shows what is possible in practice.
• Overall, our study shows that two or three sessions of remote training can be as good as one of direct training, even for relatively challenging taxonomic discriminations that include distinguishing models and mimics
On mitigating the analytical limitations of finely stratified experiments
Although attractive from a theoretical perspective, finely stratified experiments such as paired designs suffer from certain analytical limitations that are not present in block-randomized experiments with multiple treated and control individuals in each block. In short, when using a weighted difference in means to estimate the sample average treatment effect, the traditional variance estimator in a paired experiment is conservative unless the pairwise average treatment effects are constant across pairs; however, in more coarsely stratified experiments, the corresponding variance estimator is unbiased if treatment effects are constant within blocks, even if they vary across blocks. Using insights from classical least squares theory, we present an improved variance estimator that is appropriate in finely stratified experiments. The variance estimator remains conservative in expectation but is asymptotically no more conservative than the classical estimator and can be considerably less conservative. The magnitude of the improvement depends on the extent to which effect heterogeneity can be explained by observed covariates. Aided by this estimator, a new test for the null hypothesis of a constant treatment effect is proposed. These findings extend to some, but not all, superpopulation models, depending on whether the covariates are viewed as fixed across samples
Rerandomization and Regression Adjustment
Randomization is a basis for the statistical inference of treatment effects
without strong assumptions on the outcome-generating process. Appropriately
using covariates further yields more precise estimators in randomized
experiments. R. A. Fisher suggested blocking on discrete covariates in the
design stage or conducting analysis of covariance (ANCOVA) in the analysis
stage. We can embed blocking into a wider class of experimental design called
rerandomization, and extend the classical ANCOVA to more general regression
adjustment. Rerandomization trumps complete randomization in the design stage,
and regression adjustment trumps the simple difference-in-means estimator in
the analysis stage. It is then intuitive to use both rerandomization and
regression adjustment. Under the randomization-inference framework, we
establish a unified theory allowing the designer and analyzer to have access to
different sets of covariates. We find that asymptotically (a) for any given
estimator with or without regression adjustment, rerandomization never hurts
either the sampling precision or the estimated precision, and (b) for any given
design with or without rerandomization, our regression-adjusted estimator never
hurts the estimated precision. Therefore, combining rerandomization and
regression adjustment yields better coverage properties and thus improves
statistical inference. To theoretically quantify these statements, we discuss
optimal regression-adjusted estimators in terms of the sampling precision and
the estimated precision, and then measure the additional gains of the designer
and the analyzer. We finally suggest using rerandomization in the design and
regression adjustment in the analysis followed by the Huber--White robust
standard error
- …