7,886 research outputs found
Expanding the Role of Synthetic Data at the U.S. Census Bureau
National Statistical offices (NSOs) create official statistics from data collected directly from survey respondents, from government administrative records and from other third party sources. The raw source data, regardless of origin, is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of data users to extract as much information as possible from rich microdata. Traditional disclosure protection techniques applied to resolve this tension have resulted in official data products that come no where close to fully utilizing the information content of the underlying microdata. Typically, these products take for the form of basic, aggregate tabulations. In a few cases anonymized public-use micro samples are made available, but these are increasingly under risk of re-identification by the ever larger amounts of information about individuals and firms that is available in the public domain. One potential approach for overcoming these risks is to release products based on synthetic or partially synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata rather than making the actual underlying microdata available. We discuss recent Census Bureau work to develop and deploy such products. We also discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics
Brand Capital and Incumbent Firms\u27 Positions in Evolving Markets
In many advertising-intensive industries one observes market share persistence, i.e., firms maintaining lead market shares over long periods of time. I hypothesize that firms that have the largest stock of well-established brands, a stock that I term brand capital, are most likely to introduce new products in response to new market information about consumer preferences. Firms with less brand capital delay their introductions until the uncertainty concerning the market size is reduced. I present empirical support in a study of new product introductions in the U.S. beverage industry
Force-induced rupture of a DNA duplex
The rupture of double-stranded DNA under stress is a key process in
biophysics and nanotechnology. In this article we consider the shear-induced
rupture of short DNA duplexes, a system that has been given new importance by
recently designed force sensors and nanotechnological devices. We argue that
rupture must be understood as an activated process, where the duplex state is
metastable and the strands will separate in a finite time that depends on the
duplex length and the force applied. Thus, the critical shearing force required
to rupture a duplex within a given experiment depends strongly on the time
scale of observation. We use simple models of DNA to demonstrate that this
approach naturally captures the experimentally observed dependence of the
critical force on duplex length for a given observation time. In particular,
the critical force is zero for the shortest duplexes, before rising sharply and
then plateauing in the long length limit. The prevailing approach, based on
identifying when the presence of each additional base pair within the duplex is
thermodynamically unfavorable rather than allowing for metastability, does not
predict a time-scale-dependent critical force and does not naturally
incorporate a critical force of zero for the shortest duplexes. Additionally,
motivated by a recently proposed force sensor, we investigate application of
stress to a duplex in a mixed mode that interpolates between shearing and
unzipping. As with pure shearing, the critical force depends on the time scale
of observation; at a fixed time scale and duplex length, the critical force
exhibits a sigmoidal dependence on the fraction of the duplex that is subject
to shearing.Comment: 10 pages, 6 figure
REGRESSION ADJUSTMENT AND STRATIFICATION BY PROPENSTY SCORE IN TREATMENT EFFECT ESTIMATION
Propensity score adjustment of effect estimates in observational studies of treatment is a common technique used to control for bias in treatment assignment. In situations where matching on propensity score is not possible or desirable, regression adjustment and stratification are two options. Regression adjustment is used most often and can be highly efficient, but it can lead to biased results when model assumptions are violated. Validity of the stratification approach depends on fewer model assumptions, but is less efficient than regression adjustment when the regression assumptions hold. To investigate these issues, by simulation we compare stratification and regression adjustments. We consider two stratification approaches; equal frequency classes and an approach the attempts to minimize the mean squared error (MSE) of the treatment effect estimate. The regression approach we consider is a Generalized Additive Model (GAM), that flexibly estimates the relations among propensity score, treatment assignment, and outcome. We find that, under a wide range of plausible data generating distributions, the GAM approach outperforms stratification in treatment effect estimation with respect to bias, variance, and thereby MSE. We illustrate approaches via analysis of data on insurance plan choice and its relation to satisfaction with asthma care
OPTIMAL PROPENSITY SCORE STRATIFICATION
Stratifying on propensity score in observational studies of treatment is a common technique used to control for bias in treatment assignment; however, there have been few studies of the relative efficiency of the various ways of forming those strata. The standard method is to use the quintiles of propensity score to create subclasses, but this choice is not based on any measure of performance either observed or theoretical. In this paper, we investigate the optimal subclassification of propensity scores for estimating treatment effect with respect to mean squared error of the estimate. We consider the optimal formation of subclasses within formation schemes that require either equal frequency of observations within each subclass or equal variance of the effect estimate within each subclass. Under these restrictions, choosing the partition is reduced to choosing the number of subclasses. We also consider an overalll optimal partition that produces an effect estimate with minimum MSE among all partitions considered. To create this stratification, the investigator must choose both the number of subclasses and their placement. Finally, we present a stratified propensity score analysis of data concerning insurance plan choice and its relation to satisfaction with asthma care
EFFICIENT EVALUATION OF RANKING PROCEDURES WHEN THE NUMBER OF UNITS IS LARGE WITH APPLICATION TO SNP IDENTIFICATION
Simulation-based assessment is a popular and frequently necessary approach to evaluation of statistical procedures. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample theory when conducting a simulation using the analysis of genomic data as a motivating example. The approach uses convergence results to provide an approximation to smaller-sample results, results that are available only by simulation. We consider evaluating and comparing a variety of ranking-based methods for identifying the most highly associated SNPs in a genome-wide association study, derive integral equation representations of the pre-posterior distribution of percentiles produced by three ranking methods, and provide examples comparing performance. These results are of interest in their own right and set the framework for a more extensive set of comparisons
- …