78 research outputs found

    Lasso adjustments of treatment effect estimates in randomized experiments

    Full text link
    We provide a principled way for investigators to analyze randomized experiments when the number of covariates is large. Investigators often use linear multivariate regression to analyze randomized experiments instead of simply reporting the difference of means between treatment and control groups. Their aim is to reduce the variance of the estimated treatment effect by adjusting for covariates. If there are a large number of covariates relative to the number of observations, regression may perform poorly because of overfitting. In such cases, the Lasso may be helpful. We study the resulting Lasso-based treatment effect estimator under the Neyman-Rubin model of randomized experiments. We present theoretical conditions that guarantee that the estimator is more efficient than the simple difference-of-means estimator, and we provide a conservative estimator of the asymptotic variance, which can yield tighter confidence intervals than the difference-of-means estimator. Simulation and data examples show that Lasso-based adjustment can be advantageous even when the number of covariates is less than the number of observations. Specifically, a variant using Lasso for selection and OLS for estimation performs particularly well, and it chooses a smoothing parameter based on combined performance of Lasso and OLS

    Seasonal changes in the concentrations of dissolved oxygen in the lakes of the “Bory Tucholskie” National Park

    Get PDF
    The article presents the results of the examinations of dissolved oxygen vertical distribution (DO) in the deepest places of the lakes conducted at different times in the years 2003-2005, and even earlier. The authors draw particular attention to severe oxygen deficits in the deepest places of the lakes, both those deep and shallow lakes despite the fact that they are not so exposed to anthropopressure. They also point out to the similarity of the course of oxygen curves in the same lakes and seasons in consecutive years, and also differences between particular lakes. They have also determined minor correlation between the mean concentration of DO in the vertical distribution and the duration of period with the ice cover (R2=0.78)

    DP-TBART: A Transformer-based Autoregressive Model for Differentially Private Tabular Data Generation

    Full text link
    The generation of synthetic tabular data that preserves differential privacy is a problem of growing importance. While traditional marginal-based methods have achieved impressive results, recent work has shown that deep learning-based approaches tend to lag behind. In this work, we present Differentially-Private TaBular AutoRegressive Transformer (DP-TBART), a transformer-based autoregressive model that maintains differential privacy and achieves performance competitive with marginal-based methods on a wide variety of datasets, capable of even outperforming state-of-the-art methods in certain settings. We also provide a theoretical framework for understanding the limitations of marginal-based approaches and where deep learning-based approaches stand to contribute most. These results suggest that deep learning-based techniques should be considered as a viable alternative to marginal-based methods in the generation of differentially private synthetic tabular data

    Data reliability in citizen science: learning curve and the effects of training method, volunteer background and experience on identification accuracy of insects visiting ivy flowers

    Get PDF
    • Citizen science, the involvement of volunteers in collecting of scientific data, can be a useful research tool. However, data collected by volunteers are often of lower quality than that collected by professional scientists. • We studied the accuracy with which volunteers identified insects visiting ivy (Hedera) flowers in Sussex, England. In the first experiment, we examined the effects of training method, volunteer background and prior experience. Fifty-three participants were trained for the same duration using one of three different methods (pamphlet, pamphlet + slide show, pamphlet + direct training). Almost immediately following training, we tested the ability of participants to identify live insects on ivy flowers to one of 10 taxonomic categories and recorded whether their identifications were correct or incorrect, without providing feedback. • The results showed that the type of training method had a significant effect on identification accuracy (P = 0.008). Participants identified 79.1% of insects correctly after using a one-page colour pamphlet, 85.6% correctly after using the pamphlet and viewing a slide show, and 94.3% correctly after using the pamphlet in combination with direct training in the field. • As direct training cannot be delivered remotely, in the following year we conducted a second experiment, in which a different sample of 26 volunteers received the pamphlet plus slide show training repeatedly three times. Moreover, in this experiment participants received c. 2 minutes of additional training material, either videos of insects or stills taken from the videos. Testing showed that identification accuracy increased from 88.6% to 91.3% to 97.5% across the three successive tests. We also found a borderline significant interaction between the type of additional material and the test number (P = 0.053), such that the video gave fewer errors than stills in the first two tests only. • The most common errors made by volunteers were misidentifications of honey bees and social wasps with their hover fly mimics. We also tested six experts who achieved nearly perfect accuracy (99.8%), which shows what is possible in practice. • Overall, our study shows that two or three sessions of remote training can be as good as one of direct training, even for relatively challenging taxonomic discriminations that include distinguishing models and mimics

    On mitigating the analytical limitations of finely stratified experiments

    Get PDF
    Although attractive from a theoretical perspective, finely stratified experiments such as paired designs suffer from certain analytical limitations that are not present in block-randomized experiments with multiple treated and control individuals in each block. In short, when using a weighted difference in means to estimate the sample average treatment effect, the traditional variance estimator in a paired experiment is conservative unless the pairwise average treatment effects are constant across pairs; however, in more coarsely stratified experiments, the corresponding variance estimator is unbiased if treatment effects are constant within blocks, even if they vary across blocks. Using insights from classical least squares theory, we present an improved variance estimator that is appropriate in finely stratified experiments. The variance estimator remains conservative in expectation but is asymptotically no more conservative than the classical estimator and can be considerably less conservative. The magnitude of the improvement depends on the extent to which effect heterogeneity can be explained by observed covariates. Aided by this estimator, a new test for the null hypothesis of a constant treatment effect is proposed. These findings extend to some, but not all, superpopulation models, depending on whether the covariates are viewed as fixed across samples

    Rerandomization and Regression Adjustment

    Full text link
    Randomization is a basis for the statistical inference of treatment effects without strong assumptions on the outcome-generating process. Appropriately using covariates further yields more precise estimators in randomized experiments. R. A. Fisher suggested blocking on discrete covariates in the design stage or conducting analysis of covariance (ANCOVA) in the analysis stage. We can embed blocking into a wider class of experimental design called rerandomization, and extend the classical ANCOVA to more general regression adjustment. Rerandomization trumps complete randomization in the design stage, and regression adjustment trumps the simple difference-in-means estimator in the analysis stage. It is then intuitive to use both rerandomization and regression adjustment. Under the randomization-inference framework, we establish a unified theory allowing the designer and analyzer to have access to different sets of covariates. We find that asymptotically (a) for any given estimator with or without regression adjustment, rerandomization never hurts either the sampling precision or the estimated precision, and (b) for any given design with or without rerandomization, our regression-adjusted estimator never hurts the estimated precision. Therefore, combining rerandomization and regression adjustment yields better coverage properties and thus improves statistical inference. To theoretically quantify these statements, we discuss optimal regression-adjusted estimators in terms of the sampling precision and the estimated precision, and then measure the additional gains of the designer and the analyzer. We finally suggest using rerandomization in the design and regression adjustment in the analysis followed by the Huber--White robust standard error
    • …
    corecore