14,045 research outputs found
Synthetic Establishment Microdata Around the World
In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature
Avoiding disclosure of individually identifiable health information: a literature review
Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.public use files, disclosure avoidance, reidentification, de-identification, data utility
Multiple imputation for sharing precise geographies in public use data
When releasing data to the public, data stewards are ethically and often
legally obligated to protect the confidentiality of data subjects' identities
and sensitive attributes. They also strive to release data that are informative
for a wide range of secondary analyses. Achieving both objectives is
particularly challenging when data stewards seek to release highly resolved
geographical information. We present an approach for protecting the
confidentiality of data with geographic identifiers based on multiple
imputation. The basic idea is to convert geography to latitude and longitude,
estimate a bivariate response model conditional on attributes, and simulate new
latitude and longitude values from these models. We illustrate the proposed
methods using data describing causes of death in Durham, North Carolina. In the
context of the application, we present a straightforward tool for generating
simulated geographies and attributes based on regression trees, and we present
methods for assessing disclosure risks with such simulated data.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS506 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Creation of public use files: lessons learned from the comparative effectiveness research public use files data pilot project
In this paper we describe lessons learned from the creation of Basic Stand Alone (BSA) Public Use Files (PUFs) for the Comparative Effectiveness Research Public Use Files Data Pilot Project (CER-PUF). CER-PUF is aimed at increasing access to the Centers for Medicare and Medicaid Services (CMS) Medicare claims datasets through PUFs that: do not require user fees and data use agreements, have been de-identified to assure the confidentiality of the beneficiaries and providers, and still provide substantial analytic utility to researchers. For this paper we define PUFs as datasets characterized by free and unrestricted access to any user. We derive lessons learned from five major project activities: (i) a review of the statistical and computer science literature on best practices in PUF creation, (ii) interviews with comparative effectiveness researchers to assess their data needs, (iii) case studies of PUF initiatives in the United States, (iv) interviews with stakeholders to identify the most salient issues regarding making microdata publicly available, and (v) the actual process of creating the Medicare claims data BSA PUFs
Location Tracing and Potential Risks in Interaction Data Sets
Location-aware mobile phone handsets have become increasingly common in recent years, giving rise to a wide variety of location based services that rely on a person’s mobile phone reporting its current location to a remote service provider. Previous research has demonstrated that services that geo-code status updates may permit the estimation of both the rough location of users’ home locations and those of their workplaces. The paper investigates the disclosure risks of a priori knowledge of a person’s home and workplace locations, or of their current and previous home locations. Detailed interaction data sets published from censuses or other sources are characterised by the sparsity of the contained data, such that unique combinations of two locations may often be observed. In the most detailed 2011 migration data 37% of migrants had a unique combination of origin and destination, whilst in the most detailed journey to work data, 58% of workers had a unique combination of home and workplace. The amount of additional attribute data that might be disclosed is limited. When more coarse geographies are used their still remain a non-trivial number of persons with unique location combinations, with considerably more attributes potentially disclosable
- …