5,474 research outputs found

    A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data

    Get PDF
    Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between confidentiality protection and inference quality. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The United States Census Bureau collects millions of interrelated time series micro-data that are hierarchical and contain many zeros and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian Generalized Linear Mixed Models (BGLMM) with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the of magnitudes or number of entities. We find that as the prior distributions of the variance components in the BGLMM become more precise toward zero, confidentiality protection increases and inference quality deteriorates. We evaluate our methodology using a strict privacy measure, empirical differential privacy, and a newly defined risk measure, Probability of Range Identification (PoRI), which directly measures attribute disclosure risk. We illustrate our results with the U.S. Census Bureau’s Quarterly Workforce Indicators

    A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models

    Full text link
    Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records (EHRs) data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic EHR data. The results further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context

    Generating Synthetic Longitudinal Patient Data with the PrivBayes Method

    Get PDF
    In this thesis, the PrivBayes method is used to generate synthetic longitudinal patient data and the quality of the generated data is evaluated. In addition, this thesis briefly discusses the current situation of processing health data in Finland and proposes a simplistic definition of synthetic tabular data as well as presents different methods to evaluate the utility of generated synthetic data. The PrivBayes method is based on approximating the association structure of a data set using a Bayesian network and generating synthetic data from the conditional distributions corresponding to the structure of the network. The method ensures the privacy of the data by applying differential privacy through the addition of noise in the data generation process in a specific way. The method is applied to data collected from the database of Auria Clinical Informatics under permission number T152/2017. The data set consists of 2890 individual patients diagnosed with either type 1 or type 2 diabetes and seven different characteristics collected for each patient: age, body mass index, complications related to diabetes, gender, type of diabetes and two measurements for glycated hemoglobin that represent the repeated measurements in the data. The PrivBayes method is evaluated by generating 27 different synthetic data sets, describing the structures of the Bayesian network of each data set and visually inspecting differences between the original data and each synthetic data set. Differences between data sets are considered in terms of similarity of univariate distributions, differences in Pearson’s sample correlation coefficients and sample Cramer’s V coefficients and the results of a linear mixed-effects model. In conclusion, the PrivBayes method failed to produce synthetic longitudinal patient data of sufficient quality to be applicable as such in practice. However, this thesis revealed some shortcomings of the method and potential targets for further research and development.Tässä pro gradu -tutkielmassa käytetään PrivBayes-menetelmää synteettisen potilasseuranta- aineiston tuottamiseksi ja arvioidaan tuotetun aineiston laatua. Tämän lisäksi tutkielmassa kerrotaan lyhyesti terveystietojen käsittelyn nykytilanteesta Suomessa, minkä lisäksi ehdotetaan yksinkertaista määritelmää synteettiselle taulukkomuotoiselle aineistolle sekä esitellään menetelmiä tuotetun synteettisen aineiston käytettävyyden arvioimiseksi. PrivBayes-menetelmä perustuu aineistossa esiintyvien assosiaatiorakenteiden mallintamiseen Bayes-verkon avulla ja synteettisen aineiston tuottamiseen ehdollisista jakaumista, jotka vastaavat verkon rakennetta. Menetelmä varmistaa aineiston tietosuojan soveltamalla differentiaalista yksityisyyttä, jossa aineiston tuotantoprosessiin lisätään tietyn tyyppistä kohinaa. Menetelmää sovelletaan aineistoon, joka on kerätty Auria Tietopalveluiden tietokannasta tietolupanumerolla T152/2017. Aineisto koostuu 2890 yksittäisestä potilaasta, joilla on diagnosoitu joko tyypin 1 tai 2 diabetes, ja seitsemästä eri potilaita kuvaavasta muuttujasta: iästä, painoindeksistä, diabetekseen liittyvistä komplikaatiosta, sukupuolesta, diabeteksen tyypistä sekä kahdesta glykatoituneen hemoglobiinin mittauksesta, jotka edustavat seurantamittauksia aineistossa. PrivBayes-menetelmää arvioidaan luomalla 27 erilaista synteettistä aineistoa, kuvailemalla kutakin aineistoa vastaava Bayes-verkon rakenne sekä arvioimalla visuaalisesti alkuperäisen aineiston ja synteettisen aineiston välisiä eroja yksiulotteisissa jakaumissa, Pearsonin otoskorrelaatio- ja Cramerin V-kertoimissa sekä lineaarisen sekamallin tuloksissa. Tutkielman johtopäätöksenä voidaan todeta, että PrivBayes-menetelmä ei kyennyt tuottamaan riittävän laadukasta synteettistä potilasseuranta-aineistoa, jota voitaisiin sellaisenaan soveltaa käytännössä. Tutkielma kuitenkin paljasti joitakin menetelmän puutteita sekä mahdollisia kohteita jatkotutkimukselle ja -kehitykselle

    When Machine Learning Models Leak: An Exploration of Synthetic Training Data

    Full text link
    We investigate an attack on a machine learning model that predicts whether a person or household will relocate in the next two years, i.e., a propensity-to-move classifier. The attack assumes that the attacker can query the model to obtain predictions and that the marginal distribution of the data on which the model was trained is publicly available. The attack also assumes that the attacker has obtained the values of non-sensitive attributes for a certain number of target individuals. The objective of the attack is to infer the values of sensitive attributes for these target individuals. We explore how replacing the original data with synthetic data when training the model impacts how successfully the attacker can infer sensitive attributes.\footnote{Original paper published at PSD 2022. The paper was subsequently updated.

    The economic effects of special purpose entities on corporate tax avoidance

    Full text link
    This study provides the first large‐sample evidence on the economic tax effects of special purpose entities (SPEs). These increasingly common organizational structures facilitate corporate tax savings by enabling sponsor‐firms to increase tax‐advantaged activities and/or enhance their tax efficiency (i.e., relative tax savings of a given activity). Using path analysis, we find that SPEs facilitate greater tax avoidance, such that an economically large amount of cash tax savings from research and development (R&D), depreciable assets, net operating loss carryforwards, intangible assets, foreign operations, and tax havens occur in conjunction with SPE use. We estimate that SPEs help generate over $330 billion of incremental cash tax savings, or roughly 6% of total U.S. federal corporate income tax collections during the sample period. Interaction analyses reveal that SPEs enhance the tax efficiency of intangibles and R&D by 61.5% to 87.5%. Overall, these findings provide economic insight into complex organizational structures supporting corporate tax avoidance.Accepted manuscrip

    Synthetic Observational Health Data with GANs: from slow adoption to a boom in medical research and ultimately digital twins?

    Full text link
    After being collected for patient care, Observational Health Data (OHD) can further benefit patient well-being by sustaining the development of health informatics and medical research. Vast potential is unexploited because of the fiercely private nature of patient-related data and regulations to protect it. Generative Adversarial Networks (GANs) have recently emerged as a groundbreaking way to learn generative models that produce realistic synthetic data. They have revolutionized practices in multiple domains such as self-driving cars, fraud detection, digital twin simulations in industrial sectors, and medical imaging. The digital twin concept could readily apply to modelling and quantifying disease progression. In addition, GANs posses many capabilities relevant to common problems in healthcare: lack of data, class imbalance, rare diseases, and preserving privacy. Unlocking open access to privacy-preserving OHD could be transformative for scientific research. In the midst of COVID-19, the healthcare system is facing unprecedented challenges, many of which of are data related for the reasons stated above. Considering these facts, publications concerning GAN applied to OHD seemed to be severely lacking. To uncover the reasons for this slow adoption, we broadly reviewed the published literature on the subject. Our findings show that the properties of OHD were initially challenging for the existing GAN algorithms (unlike medical imaging, for which state-of-the-art model were directly transferable) and the evaluation synthetic data lacked clear metrics. We find more publications on the subject than expected, starting slowly in 2017, and since then at an increasing rate. The difficulties of OHD remain, and we discuss issues relating to evaluation, consistency, benchmarking, data modelling, and reproducibility.Comment: 31 pages (10 in previous version), not including references and glossary, 51 in total. Inclusion of a large number of recent publications and expansion of the discussion accordingl
    corecore