6,700 research outputs found

    Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints

    Get PDF
    Publishing data about patients that contain both demographics and diagnosis codes is essential to perform large-scale, low-cost medical studies. However, preserving the privacy and utility of such data is challenging, because it requires: (i) guarding against identity disclosure (re-identification) attacks based on both demographics and diagnosis codes, (ii) ensuring that the anonymized data remain useful in intended analysis tasks, and (iii) minimizing the information loss, incurred by anonymization, to preserve the utility of general analysis tasks that are difficult to determine before data publishing. Existing anonymization approaches are not suitable for being used in this setting, because they cannot satisfy all three requirements. Therefore, in this work, we propose a new approach to deal with this problem. We enforce the requirement (i) by applying (k; k^m)-anonymity, a privacy principle that prevents re-identification from attackers who know the demographics of a patient and up to m of their diagnosis codes, where k and m are tunable parameters. To capture the requirement (ii), we propose the concept of utility constraint for both demographics and diagnosis codes. Utility constraints limit the amount of generalization and are specified by data owners (e.g., the healthcare institution that performs anonymization). We also capture requirement (iii), by employing well-established information loss measures for demographics and for diagnosiscodes. To realize our approach, we develop an algorithm that enforces (k; k^m)-anonymity on a dataset containing both demographics and diagnosis codes, in a way that satisfies the specified utility constraints and with minimal information loss, according to the measures. Our experiments with a large dataset containing more than 200; 000 electronic health recordsshow the effectiveness and efficiency of our algorithm

    Race and “Hotspots” of Preventable Hospitalizations

    Full text link
    Abstract Preventable hospitalizations (PHs) are those for ambulatory care-sensitive conditions that indicate insufficiencies in local primary healthcare. PH rates tend to be higher among African Americans, in urban centers, rural areas and areas with more African American residents. The objective of this study is to determine geographic clusters of high PH rates (“spatial clusters”) by race. Data from Maryland hospitals were utilized to determine the rates of PHs in zip code tabulation areas (ZCTAs) by race in 2010. Geographic clusters of ZCTAs with higher than expected PH rates were identified using Scan Statistic and Anselin’s Local Moran’s I. 10 PH spatial clusters were observed among the total population with an average PH rate of 3,046.6 per 100,000 population. Among whites, the average PH rate was 3,339.9 per 100,000 in 11 PH spatial clusters. Only five PH spatial clusters were observed among African Americans with a higher average PH rate (3,710.8 per 100,000). The locations and other characteristics of PH spatial clusters differed by race. These results can be used to target resources to areas with high PH rates. Because PH spatial clusters are observed in differing locations for African Americans, approaches that include cultural tailoring may need to be specifically targeted

    Algorithms to anonymize structured medical and healthcare data:A systematic review

    Get PDF
    Introduction: With many anonymization algorithms developed for structured medical health data (SMHD) in the last decade, our systematic review provides a comprehensive bird’s eye view of algorithms for SMHD anonymization. Methods: This systematic review was conducted according to the recommendations in the Cochrane Handbook for Reviews of Interventions and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Eligible articles from the PubMed, ACM digital library, Medline, IEEE, Embase, Web of Science Collection, Scopus, ProQuest Dissertation, and Theses Global databases were identified through systematic searches. The following parameters were extracted from the eligible studies: author, year of publication, sample size, and relevant algorithms and/or software applied to anonymize SMHD, along with the summary of outcomes. Results: Among 1,804 initial hits, the present study considered 63 records including research articles, reviews, and books. Seventy five evaluated the anonymization of demographic data, 18 assessed diagnosis codes, and 3 assessed genomic data. One of the most common approaches was k-anonymity, which was utilized mainly for demographic data, often in combination with another algorithm; e.g., l-diversity. No approaches have yet been developed for protection against membership disclosure attacks on diagnosis codes. Conclusion: This study reviewed and categorized different anonymization approaches for MHD according to the anonymized data types (demographics, diagnosis codes, and genomic data). Further research is needed to develop more efficient algorithms for the anonymization of diagnosis codes and genomic data. The risk of reidentification can be minimized with adequate application of the addressed anonymization approaches. Systematic Review Registration: [http://www.crd.york.ac.uk/prospero], identifier [CRD42021228200].</p

    Temporal Subtyping of Alzheimer's Disease Using Medical Conditions Preceding Alzheimer's Disease Onset in Electronic Health Records

    Full text link
    Subtyping of Alzheimer's disease (AD) can facilitate diagnosis, treatment, prognosis and disease management. It can also support the testing of new prevention and treatment strategies through clinical trials. In this study, we employed spectral clustering to cluster 29,922 AD patients in the OneFlorida Data Trust using their longitudinal EHR data of diagnosis and conditions into four subtypes. These subtypes exhibit different patterns of progression of other conditions prior to the first AD diagnosis. In addition, according to the results of various statistical tests, these subtypes are also significantly different with respect to demographics, mortality, and prescription medications after the AD diagnosis. This study could potentially facilitate early detection and personalized treatment of AD as well as data-driven generalizability assessment of clinical trials for AD.Comment: 10 page

    Discovering Patient Phenotypes Using Generalized Low Rank Models

    Get PDF
    The practice of medicine is predicated on discovering commonalities or distinguishing characteristics among patients to inform corresponding treatment. Given a patient grouping (hereafter referred to as a p henotype ), clinicians can implement a treatment pathway accounting for the underlying cause of disease in that phenotype. Traditionally, phenotypes have been discovered by intuition, experience in practice, and advancements in basic science, but these approaches are often heuristic, labor intensive, and can take decades to produce actionable knowledge. Although our understanding of disease has progressed substantially in the past century, there are still important domains in which our phenotypes are murky, such as in behavioral health or in hospital settings. To accelerate phenotype discovery, researchers have used machine learning to find patterns in electronic health records, but have often been thwarted by missing data, sparsity, and data heterogeneity. In this study, we use a flexible framework called Generalized Low Rank Modeling (GLRM) to overcome these barriers and discover phenotypes in two sources of patient data. First, we analyze data from the 2010 Healthcare Cost and Utilization Project National Inpatient Sample (NIS), which contains upwards of 8 million hospitalization records consisting of administrative codes and demographic information. Second, we analyze a small (N=1746), local dataset documenting the clinical progression of autism spectrum disorder patients using granular features from the electronic health record, including text from physician notes. We demonstrate that low rank modeling successfully captures known and putative phenotypes in these vastly different datasets
    • …
    corecore