Synthesizing Accurate Relational Data under Differential Privacy
Abstract
Medical data is sensitive personal data which, according to GDPR and HIPAA, necessitates regulations concerning their use. Anonymizing this data prior to research would allow for broader access, due to a lower sensitivity. Privacy-aware data synthesis has been proposed as a solution. However, current algorithms face difficulties in synthesizing medical data while maintaining privacy and utility. This is due to the structure of medical data which consists of multiple interlinked tables with high dimensional columns containing sequential aspects of the patient trajectory. The resulting number of correlations is intractable to model naively and, if relational correlations are not accounted for, the resulting data has poor utility (e.g., leads to invalid patient trajectories). In this paper, we present MARE, a relational synthesis algorithm which focuses on a set of core correlations found in relational data while pruning others. The resulting lower computational complexity allows MARE to produce accurate relational data. We showcase that MARE can synthesize multiple medical datasets, which contain sequential aspects, while maintaining utility in form of inter-table and inter-row correlations and privacy guarantees- contributionToPeriodical
- info:eu-repo/semantics/publishedVersion
- differential privacy
- data privacy
- electronic health record
- synthetic data generation
- /dk/atira/pure/sustainabledevelopmentgoals/good_health_and_well_being; name=SDG 3 - Good Health and Well-being
- /dk/atira/pure/sustainabledevelopmentgoals/peace_justice_and_strong_institutions; name=SDG 16 - Peace, Justice and Strong Institutions