382 research outputs found
Avoiding disclosure of individually identifiable health information: a literature review
Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.public use files, disclosure avoidance, reidentification, de-identification, data utility
Statistical disclosure control for numeric microdata via sequential joint probability preserving data shuffling
Traditional perturbative statistical disclosure control (SDC) approaches such
as microaggregation, noise addition, rank swapping, etc, perturb the data in an
``ad-hoc" way in the sense that while they manage to preserve some particular
aspects of the data, they end up modifying others. Synthetic data approaches
based on the fully conditional specification data synthesis paradigm, on the
other hand, aim to generate new datasets that follow the same joint probability
distribution as the original data. These synthetic data approaches, however,
rely either on parametric statistical models, or non-parametric machine
learning models, which need to fit well the original data in order to generate
credible and useful synthetic data. Another important drawback is that they
tend to perform better when the variables are synthesized in the correct causal
order (i.e., in the same order as the true data generating process), which is
often unknown in practice. To circumvent these issues, we propose a fully
non-parametric and model free perturbative SDC approach that approximates the
joint distribution of the original data via sequential applications of
restricted permutations to the numerical microdata (where the restricted
permutations are guided by the joint distribution of a discretized version of
the data). Empirical comparisons against popular SDC approaches, using both
real and simulated datasets, suggest that the proposed approach is competitive
in terms of the trade-off between confidentiality and data utility.Comment: 25 page, 12 figure
Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro
The demand for data from surveys, censuses or registers containing sensible information on people or enterprises has increased significantly over the last years. However, before data can be provided to the public or to researchers, confidentiality has to be respected for any data set possibly containing sensible information about individual units. Confidentiality can be achieved by applying statistical disclosure control (SDC) methods to the data in order to decrease the disclosure risk of data.The R package sdcMicro serves as an easy-to-handle, object-oriented S4 class implementation of SDC methods to evaluate and anonymize confidential micro-data sets. It includes all popular disclosure risk and perturbation methods. The package performs automated recalculation of frequency counts, individual and global risk measures, information loss and data utility statistics after each anonymization step. All methods are highly optimized in terms of computational costs to be able to work with large data sets. Reporting facilities that summarize the anonymization process can also be easily used by practitioners. We describe the package and demonstrate its functionality with a complex household survey test data set that has been distributed by the International Household Survey Network
On Utilizing Association and Interaction Concepts for Enhancing Microaggregation in Secure Statistical Databases
This paper presents a possibly pioneering endeavor to tackle the microaggregation techniques (MATs) in secure statistical databases by resorting to the principles of associative neural networks (NNs). The prior art has improved the available solutions to the MAT by incorporating proximity information, and this approach is done by recursively reducing the size of the data set by excluding points that are farthest from the centroid and points that are closest to these farthest points. Thus, although the method is extremely effective, arguably, it uses only the proximity information while ignoring the mutual interaction between the records. In this paper, we argue that interrecord relationships can be quantified in terms of the following two entities: 1) their ldquoassociationrdquo and 2) their ldquointeraction.rdquo This case means that records that are not necessarily close to each other may still be ldquogrouped,rdquo because their mutual interaction, which is quantified by invoking transitive-closure-like operations on the latter entity, could be significant, as suggested by the theoretically sound principles of NNs. By repeatedly invoking the interrecord associations and interactions, the records are grouped into sizes of cardinality ldquok,rdquo where k is the security parameter in the algorithm. Our experimental results, which are done on artificial data and benchmark real-life data sets, demonstrate that the newly proposed method is superior to the state of the art not only based on the information loss (IL) perspective but also when it concerns a criterion that involves a combination of the IL and the disclosure risk (DR)
- …