10 research outputs found

    30 Years of Synthetic Data

    Full text link
    The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.Comment: 42 page

    MI Double Feature: Multiple Imputation to Address Nonresponse and Rounding Errors in Income Questions

    Get PDF
    Obtaining reliable income information in surveys is difficult for two reasons. On the one hand, many survey respondents consider income to be sensitive information and thus are reluctant to answer questions regarding their income. If those survey participants that do not provide information on their income are systematically different from the respondents - and there is ample of research indicating that they are - results based only on the observed income values will be misleading. On the other hand, respondents tend to round their income. Especially this second source of error is usually ignored when analyzing the income information. In a recent paper, Drechsler and Kiesl (2014) illustrated that inferences based on the collected information can be biased if the rounding is ignored and suggested a multiple imputation strategy to account for the rounding in reported income. In this paper we extend their approach to also address the nonresponse problem. We illustrate the approach using the household income variable from the German panel study "Labor Market and Social Security''

    Controlling privacy loss in sampling schemes: an analysis of stratified and cluster sampling

    Get PDF
    Sampling schemes are fundamental tools in statistics, survey design, and algorithm design. A fundamental result in differential privacy is that a differentially private mechanism run on a simple random sample of a population provides stronger privacy guarantees than the same algorithm run on the entire population. However, in practice, sampling designs are often more complex than the simple, data-independent sampling schemes that are addressed in prior work. In this work, we extend the study of privacy amplification results to more complex, data-dependent sampling schemes. We find that not only do these sampling schemes often fail to amplify privacy, they can actually result in privacy degradation. We analyze the privacy implications of the pervasive cluster sampling and stratified sampling paradigms, as well as provide some insight into the study of more general sampling designsCB20ADR0160001 - U.S. Census Bureauhttps://drops.dagstuhl.de/opus/volltexte/2022/16524/pdf/LIPIcs-FORC-2022-1.pdfPublished versio

    Differential Privacy for Government Agencies -- Are We There Yet?

    Full text link
    Government agencies typically need to take potential risks of disclosure into account whenever they publish statistics based on their data or give external researchers access to collected data. In this context, the promise of formal privacy guarantees offered by concepts such as differential privacy seems to be the panacea enabling the agencies to quantify and control the privacy loss incurred by any data release exactly. Nevertheless, despite the excitement in academia and industry, most agencies -- with the prominent exception of the U.S. Census Bureau -- have been reluctant to even consider the concept for their data release strategy. This paper discusses potential reasons for this. We argue that the requirements for implementing differential privacy approaches at government agencies are often fundamentally different from the requirements in industry. This raises many challenges and questions that still need to be addressed before the concept can be used as an overarching principle when sharing data with the public. The paper does not offer any solutions to these challenges. Instead, we hope to stimulate some collaborative research efforts, as we believe that many of the problems can only be addressed by interdisciplinary collaborations.Comment: 45 pages, 0 figure

    The R Package hmi:A Convenient Tool for Hierarchical Multiple Imputation and Beyond

    Get PDF
    Applications of multiple imputation have long outgrown the traditional context of dealing with item nonresponse in cross-sectional data sets. Nowadays multiple imputation is also applied to impute missing values in hierarchical data sets, address confidentiality concerns, combine data from different sources, or correct measurement errors in surveys. However, software developments did not keep up with these recent extensions. Most imputation software can only deal with item nonresponse in cross-sectional settings and extensions for hierarchical data - if available at all - are typically limited in scope. Furthermore, to our knowledge no software is currently available for dealing with measurement error using multiple imputation approaches. The R package hmi tries to close some of these gaps. It offers multiple imputation routines in hierarchical settings for many variable types (for example, nominal, ordinal, or continuous variables). It also provides imputation routines for interval data and handles a common measurement error problem in survey data: biased inferences due to implicit rounding of the reported values. The user-friendly setup which only requires the data and optionally the specification of the analysis model of interest makes the package especially attractive for users less familiar with the peculiarities of multiple imputation. The compatibility with the popular mice package (Van Buuren and Groothuis-Oudshoorn 2011) ensures that the rich set of analysis and diagnostic tools and post-imputation functions available in mice can be used easily, once the data have been imputed

    Sediment-filled karst depressions and riyad - key archaeological environments of south Qatar

    Get PDF
    Systematic archaeological exploration of southern Qatar started in the 1950s. However, detailed local and regional data on climatic fluctuations and landscape changes during the Holocene, pivotal for understanding and reconstructing human-environment interactions, are still lacking. This contribution provides an overview on the variability of geomorphic environments of southern Qatar with a focus on depression landforms, which reveal a rich archaeological heritage ranging from Palaeolithic(?) and Early Neolithic times to the Modern era. Based on a detailed geomorphic mapping campaign, sediment cores and optically stimulated luminescence data, the dynamics of riyad (singular rawdha; shallow, small-scale, sediment-filled karst depressions clustering in the central southern peninsula) and the larger-scale Asaila depression near the western coast are studied in order to put archaeological discoveries into a wider environmental context. Geomorphic mapping of the Asaila basin shows a much greater geomorphic variability than documented in literature so far with relict signs of surface runoff. An 8 m long sediment core taken in the sabkha-type sand flats of the western basin reveals a continuous dominance of aeolian morphodynamics during the early to mid-Holocene. Mounds preserved by evaporite horizons representing capillarites originally grown in the vadose zone are a clear sign of groundwater-level drop after the sea-level highstand ca. 6000-4500 years ago. Deflation followed the lowering of the Stokes surface, leaving mounds where the relict capillarites were able to fixate and preserve the palaeo-surface. Abundant archaeological evidence of Early and Middle Neolithic occupation - the latter with a clear focus inside the central Asaila basin - indicate more favourable living conditions than today. In contrast, the sediment record of the investigated riyad in the south is very shallow, younger and controlled by surface discharge, deflation and the constantly diminishing barchan dune cover in Qatar over the Middle and Late Holocene. The young age of the infill (ca. 1500 to 2000 years) explains the absence of findings older than the Late Islamic period. Indicators of current net deflation may relate to a decrease in surface runoff and sediment supply only in recent decades to centuries. In the future, geophysical prospection of the riyad may help to locate thicker sedimentary archives and the analysis of grain size distribution, micromorphology, phytoliths or even pollen spectra may enhance our understanding of the interplay of regional environmental changes and cultural history

    Measurements of the Total and Differential Higgs Boson Production Cross Sections Combining the H??????? and H???ZZ*???4??? Decay Channels at s\sqrt{s}=8??????TeV with the ATLAS Detector

    No full text
    Measurements of the total and differential cross sections of Higgs boson production are performed using 20.3~fb1^{-1} of pppp collisions produced by the Large Hadron Collider at a center-of-mass energy of s=8\sqrt{s} = 8 TeV and recorded by the ATLAS detector. Cross sections are obtained from measured HγγH \rightarrow \gamma \gamma and HZZ4H \rightarrow ZZ ^{*}\rightarrow 4\ell event yields, which are combined accounting for detector efficiencies, fiducial acceptances and branching fractions. Differential cross sections are reported as a function of Higgs boson transverse momentum, Higgs boson rapidity, number of jets in the event, and transverse momentum of the leading jet. The total production cross section is determined to be σppH=33.0±5.3(stat)±1.6(sys)pb\sigma_{pp \to H} = 33.0 \pm 5.3 \, ({\rm stat}) \pm 1.6 \, ({\rm sys}) \mathrm{pb}. The measurements are compared to state-of-the-art predictions.Measurements of the total and differential cross sections of Higgs boson production are performed using 20.3  fb-1 of pp collisions produced by the Large Hadron Collider at a center-of-mass energy of s=8  TeV and recorded by the ATLAS detector. Cross sections are obtained from measured H→γγ and H→ZZ*→4ℓ event yields, which are combined accounting for detector efficiencies, fiducial acceptances, and branching fractions. Differential cross sections are reported as a function of Higgs boson transverse momentum, Higgs boson rapidity, number of jets in the event, and transverse momentum of the leading jet. The total production cross section is determined to be σpp→H=33.0±5.3 (stat)±1.6 (syst)  pb. The measurements are compared to state-of-the-art predictions.Measurements of the total and differential cross sections of Higgs boson production are performed using 20.3 fb1^{-1} of pppp collisions produced by the Large Hadron Collider at a center-of-mass energy of s=8\sqrt{s} = 8 TeV and recorded by the ATLAS detector. Cross sections are obtained from measured HγγH \rightarrow \gamma \gamma and HZZ4H \rightarrow ZZ ^{*}\rightarrow 4\ell event yields, which are combined accounting for detector efficiencies, fiducial acceptances and branching fractions. Differential cross sections are reported as a function of Higgs boson transverse momentum, Higgs boson rapidity, number of jets in the event, and transverse momentum of the leading jet. The total production cross section is determined to be σppH=33.0±5.3(stat)±1.6(sys)pb\sigma_{pp \to H} = 33.0 \pm 5.3 \, ({\rm stat}) \pm 1.6 \, ({\rm sys}) \mathrm{pb}. The measurements are compared to state-of-the-art predictions
    corecore