30 research outputs found
Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study
In this paper, we propose a method for measuring the similarity low sample
tabular data with synthetically generated data with a larger number of samples
than original. This process is also known as data augmentation. But
significance levels obtained from non-parametric tests are suspect when sample
size is small. Our method uses a combination of geometry, topology and robust
statistics for hypothesis testing in order to compare the validity of generated
data. We also compare the results with common global metric methods available
in the literature for large sample size data
Priv Stat Databases
In this paper we propose a method for statistical disclosure limitation of categorical variables that we call Conditional Group Swapping. This approach is suitable for design and strata-defining variables, the cross-classification of which leads to the formation of important groups or subpopulations. These groups are considered important because from the point of view of data analysis it is desirable to preserve analytical characteristics within them. In general data swapping can be quite distorting ([12, 18, 15]), especially for the relationships between the variables not only within the subpopulations but for the overall data. To reduce the damage incurred by swapping, we propose to choose the records for swapping using conditional probabilities which depend on the characteristics of the exchanged records. In particular, our approach exploits the results of propensity scores methodology for the computation of swapping probabilities. The experimental results presented in the paper show good utility properties of the method.CC999999/ImCDC/Intramural CDC HHS/United States2020-03-23T00:00:00Z32206763PMC70874077412vault:3515
Stop or Continue Data Collection: A Nonignorable Missing Data Approach for Continuous Variables
We present an approach to inform decisions about nonresponse follow-up
sampling. The basic idea is (i) to create completed samples by imputing
nonrespondents' data under various assumptions about the nonresponse
mechanisms, (ii) take hypothetical samples of varying sizes from the completed
samples, and (iii) compute and compare measures of accuracy and cost for
different proposed sample sizes. As part of the methodology, we present a new
approach for generating imputations for multivariate continuous data with
nonignorable unit nonresponse. We fit mixtures of multivariate normal
distributions to the respondents' data, and adjust the probabilities of the
mixture components to generate nonrespondents' distributions with desired
features. We illustrate the approaches using data from the 2007 U. S. Census of
Manufactures