1,164 research outputs found

    Privacy Preserving Data Publishing

    Get PDF
    Recent years have witnessed increasing interest among researchers in protecting individual privacy in the big data era, involving social media, genomics, and Internet of Things. Recent studies have revealed numerous privacy threats and privacy protection methodologies, that vary across a broad range of applications. To date, however, there exists no powerful methodologies in addressing challenges from: high-dimension data, high-correlation data and powerful attackers. In this dissertation, two critical problems will be investigated: the prospects and some challenges for elucidating the attack capabilities of attackers in mining individuals’ private information; and methodologies that can be used to protect against such inference attacks, while guaranteeing significant data utility. First, this dissertation has proposed a series of works regarding inference attacks laying emphasis on protecting against powerful adversaries with auxiliary information. In the context of genomic data, data dimensions and computation feasibility is highly challenging in conducting data analysis. This dissertation proved that the proposed attack can effectively infer the values of the unknown SNPs and traits in linear complexity, which dramatically improve the computation cost compared with traditional methods with exponential computation cost. Second, putting differential privacy guarantee into high-dimension and high-correlation data remains a challenging problem, due to high-sensitivity, output scalability and signal-to-noise ratio. Consider there are tens-of-millions of genomes in a human DNA, it is infeasible for traditional methods to introduce noise to sanitize genomic data. This dissertation has proposed a series of works and demonstrated that the proposed differentially private method satisfies differential privacy; moreover, data utility is improved compared with the states of the arts by largely lowering data sensitivity. Third, putting privacy guarantee into social data publishing remains a challenging problem, due to tradeoff requirements between data privacy and utility. This dissertation has proposed a series of works and demonstrated that the proposed methods can effectively realize privacy-utility tradeoff in data publishing. Finally, two future research topics are proposed. The first topic is about Privacy Preserving Data Collection and Processing for Internet of Things. The second topic is to study Privacy Preserving Big Data Aggregation. They are motivated by the newly proposed data mining, artificial intelligence and cybersecurity methods

    Sociotechnical Safeguards for Genomic Data Privacy

    Get PDF
    Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information

    Meta-analysis of genome-wide association studies from the CHARGE consortium identifies common variants associated with carotid intima media thickness and plaque

    Get PDF
    Carotid intima media thickness (cIMT) and plaque determined by ultrasonography are established measures of subclinical atherosclerosis that each predicts future cardiovascular disease events. We conducted a meta-analysis of genome-wide association data in 31,211 participants of European ancestry from nine large studies in the setting of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium. We then sought additional evidence to support our findings among 11,273 individuals using data from seven additional studies. In the combined meta-analysis, we identified three genomic regions associated with common carotid intima media thickness and two different regions associated with the presence of carotid plaque (P < 5 × 10 -8). The associated SNPs mapped in or near genes related to cellular signaling, lipid metabolism and blood pressure homeostasis, and two of the regions were associated with coronary artery disease (P < 0.006) in the Coronary Artery Disease Genome-Wide Replication and Meta-Analysis (CARDIoGRAM) consortium. Our findings may provide new insight into pathways leading to subclinical atherosclerosis and subsequent cardiovascular events

    Fine mapping of susceptibility loci to malaria clinical episodes in a family-based cohort from Senegal

    Get PDF
    O parasita da malária, P. falciparum, mata na ordem de um milhão de crianças Africanas em cada ano, e esta é uma pequena fracção do número de pessoas infectadas em todo o mundo. A evolução clínica de uma infecção por este parasita depende em certa medida, da constituição genética do indivíduo infectado. O papel dos factores genéticos que regulam a gravidade da infecção da malária tem sido repetidamente demonstrado em humanos e animais. Os estudos de associação são realizados com o objectivo de identificar os genes implicados na causalidade do resultado da infecção. Foi detectado anteriormente, linkage no cromossoma humano 5p15 ao número de ataques de Plasmodium falciparum (PFA) em Dielmo, uma aldeia senegalesa [48]. Posteriormente, e antes deste estudo, um levantamento usando um ensaio "GoldenGate" da Illumina, com cerca de 1.450 SNPs foi realizada na região de Linkage com o fenótipo PFA. A análise foi realizada com três programas estatísticos baseados na família: Merlin, QTDT e FBAT/PBAT. Estes programas identificaram três genes candidatos associados com o fenótipo PFA: três SNPs (rs4867417, rs7714218 e rs11959398), localizados no gene PDZD2, um SNP (rs11134099) no gene ADAMTS16, e outro (rs3777320) localizado no gene SEMA5A. O objectivo deste estudo foi investigar estas associações. Os SNPs das regiões destes genes candidatos foram escolhidos por sequenciação de exões situados na região candidata ou por análise bioinformática utilizando dados do HapMap da população Yoruba. O estudado para genotipagem foi através das análises de pré-design ou “Custom” dos SNPs (Applied Biosystems). Os dados foram incluídos num banco de dados e a verificação dos erros de transmissão mendeliana foi efectuada. As análises estatísticas foram realizadas utilizando dois programas de associação familiar, PBAT e QTDT. Foram utilizados diferentes modelos de transmissão de alelos e foi definido como limite de significância p-value = 10-3. As análises de SNPs dos genes PDZD2 e ADAMTS16 não confirmaram a associação, mas encontrou-se associação significativa com SNPs do gene SEMA5A. Um SNP (rs3777325) foi significativamente associado com o fenótipo PFA usando ambos os programas (p-value= - 6.49x10-4 usando o programa PBAT e p-value = 2.0x10-4 usando o programa QTDT). A análise de haplótipos de dois SNPs adjacentes (rs4541632 e rs1018956), também mostrou uma associação significativa do haplótipo GC (p-value= -6.82x10-5) utilizando o programa PBAT. Este estudo confirma que o locus de susceptibilidade para o fenótipo PFA está localizado no gene SEMA5A. Mais estudos serão necessários para replicar essa associação e identificar o polimorfismo causal.The malaria parasite, P. falciparum, kills on the order of a million African children each year, and this is a small fraction of the number of infected individuals world-wide. The clinical outcome of an infection by this parasite depends to some extent on the genetic makeup of the infected individual. The role of genetic factors that regulate the severity of malaria infection has been repeatedly demonstrated in humans and animals. Association studies are conducted with the aim of identifying the causal genes implicated in the outcome of infection. Linkage was previously detected on human chromosome 5p15 controlling the number of Plasmodium falciparum attacks (PFA) in Dielmo, a Senegalese village [48]. Subsequently, and prior to this present study, a fine mapping study using a "GoldenGate assay” from Illumina, with about 1450 SNPs was performed in this region of linkage with PFA phenotype. Analysis was performed with three statistical family-based programs: Merlin, QTDT, and FBAT/PBAT. These programs identified three candidate genes associated with PFA phenotype: three SNPs (rs4867417, rs7714218, and rs11959398) located in PDZD2, one SNP (rs11134099) in ADAMTS16, and one (rs3777320) in SEMA5A. The aim of this present study was to investigate these associations. Novel SNPs in the candidate regions of these genes were selected either by sequencing exons located in these candidate regions or by bioinformatics analysis using HapMap data from Yoruba population. SNPs were studied using either Pre-design or Custom SNP genotyping assay (Applied Biosystems). Data were included in an Access Database and checked for error of Mendelian transmission. Statistical analyses were performed using two family-based association programs, PBAT and QTDT. We used different models of allele transmission and defined p=10-3 as significance threshold. The analyses did not confirm the association with SNPs of PDZD2 or ADAMTS16, but did find significant association with SNPs of SEMA5A. One SNP (rs3777325) was significantly associated with PFA phenotype using both programs (p-value= -6.49x10-4 using the PBAT program and p-value=2.0x10-4 using the QTDT program). A haplotype analysis of two adjacent SNPs (rs4541632 and rs1018956) also showed a significant association of the haplotype GC (p-value= -6.82x10-5) using the PBAT program. This work confirms that a susceptibility locus to PFA phenotype is located inside SEMA5A. Further studies will be necessary to replicate this association and identify the causal polymorphism

    Privacy in the Genomic Era

    Get PDF
    Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has major implications for personal privacy; notably because the genome has certain essential features, which include (but are not limited to) (i) an association with traits and certain diseases, (ii) identification capability (e.g., forensics), and (iii) revelation of family relationships. Moreover, direct-to-consumer DNA testing increases the likelihood that genome data will be made available in less regulated environments, such as the Internet and for-profit companies. The problem of genome data privacy thus resides at the crossroads of computer science, medicine, and public policy. While the computer scientists have addressed data privacy for various data types, there has been less attention dedicated to genomic data. Thus, the goal of this paper is to provide a systematization of knowledge for the computer science community. In doing so, we address some of the (sometimes erroneous) beliefs of this field and we report on a survey we conducted about genome data privacy with biomedical specialists. Then, after characterizing the genome privacy problem, we review the state-of-the-art regarding privacy attacks on genomic data and strategies for mitigating such attacks, as well as contextualizing these attacks from the perspective of medicine and public policy. This paper concludes with an enumeration of the challenges for genome data privacy and presents a framework to systematize the analysis of threats and the design of countermeasures as the field moves forward

    Selection Dynamics in Heliconius Hybrid Zones and the Origin of Adaptive Variation

    Get PDF
    There is repeated evidence that hybridization is a major contributor to the production of adaptive diversity; however, the evolutionary fate of hybrids in natural populations remains poorly understood. In Heliconius butterflies, hybridization is common and responsible for generating a variety of warning color patterns across the genus. Predator avoidance of warning colorations appears to largely be learned, which drives strong positive frequency-dependent selection. This creates a paradox for hybrid lineages: how do novel hybrid forms manage to establish and persist under such strong selection? In this dissertation, I present a series of studies centered on the selection dynamics of Heliconius hybrid zones, to elucidate how novel adaptive traits establish in nature. Clines across hybrid zones have often been analyzed to estimate selection on ecologically important loci. Here, warning color clines were characterized and compared across multiple transects along a Heliconius hybrid zone in the Guiana Shield. Furthermore, a mark-resight experiment and communal roost observations were completed near the center of this hybrid zone to determine the survival and likelihood of establishment of native and foreign forms. These studies reveal similar survivorship of hybrid and pure color patterns, and specifically demonstrate that a rare putative hybrid form can survive and establish within a hybrid zone. Both hybrids and pure color patterns showed comparable life expectancies in the mark-resight experiment and similar patterns of presence at nocturnal roosts. These results suggest that selection on warning color pattern is relatively weak within the hybrid zone. Analyses of color pattern clines uncovered strong selection bounding the hybrid zone in bi-race areas, while weaker selection was estimated for a tri-race area. In fact, the tri-race area was three times wider than the bi-race areas. Collectively, these studies suggest that the selection dynamics across hybrid zones may play an integral role in the establishment of new adaptive traits, and offers a route by which a reputed hybrid race may have arisen. The investigations within this dissertation also provide a new view of hybrid zone dynamics, and improve our understanding of how hybridization and selection shapes the evolution of biodiversity

    Survival and divergence in a small group: The extraordinary genomic history of the endangered Apennine brown bear stragglers

    Get PDF
    About 100 km east of Rome, in the central Apennine Mountains, a critically endangered population of ∼50 brown bears live in complete isolation. Mating outside this population is prevented by several 100 km of bear-free territories. We exploited this natural experiment to better understand the gene and genomic consequences of surviving at extremely small population size. We found that brown bear populations in Europe lost connectivity since Neolithic times, when farming communities expanded and forest burning was used for land clearance. In central Italy, this resulted in a 40-fold population decline. The overall genomic impact of this decline included the complete loss of variation in the mitochondrial genome and along long stretches of the nuclear genome. Several private and deleterious amino acid changes were fixed by random drift; predicted effects include energy deficit, muscle weakness, anomalies in cranial and skeletal development, and reduced aggressiveness. Despite this extreme loss of diversity, Apennine bear genomes show nonrandom peaks of high variation, possibly maintained by balancing selection, at genomic regions significantly enriched for genes associated with immune and olfactory systems. Challenging the paradigm of increased extinction risk in small populations, we suggest that random fixation of deleterious alleles (i) can be an important driver of divergence in isolation, (ii) can be tolerated when balancing selection prevents random loss of variation at important genes, and (iii) is followed by or results directly in favorable behavioral changes

    PrivGenDB: Efficient and privacy-preserving query executions over encrypted SNP-Phenotype database

    Full text link
    Privacy and security issues limit the query executions over genomics datasets, notably single nucleotide polymorphisms (SNPs), raised by the sensitivity of this type of data. Therefore, it is important to ensure that executing queries on these datasets do not reveal sensitive information, such as the identity of the individuals and their genetic traits, to a data server. In this paper, we propose and present a novel model, we call PrivGenDB, to ensure the confidentiality of SNP-phenotype data while executing queries. The confidentiality in PrivGenDB is enabled by its system architecture and the search functionality provided by searchable symmetric encryption (SSE). To the best of our knowledge, PrivGenDB construction is the first SSE-based approach ensuring the confidentiality of SNP-phenotype data as the current SSE-based approaches for genomic data are limited only to substring search and range queries on a sequence of genomic data. Besides, a new data encoding mechanism is proposed and incorporated in the PrivGenDB model. This enables PrivGenDB to handle the dataset containing both genotype and phenotype and also support storing and managing other metadata, like gender and ethnicity, privately. Furthermore, different queries, namely Count, Boolean, Negation and k′-out-of-k match queries used for genomic data analysis, are supported and executed by PrivGenDB. The execution of these queries on genomic data in PrivGenDB is efficient and scalable for biomedical research and services. These are demonstrated by our analytical and empirical analysis presented in this paper. Specifically, our empirical studies on a dataset with 5000 entries (records) containing 1000 SNPs demonstrate that a count/Boolean query and a k′-out-of-k match query over 40 SNPs take approximately 4.3s and 86.4μs, respectively, outperforming the existing schemes

    Statistical Methods and Privacy Preserving Protocols for Combining Genetic Data with Electronic Health Records

    Full text link
    In recent years, electronic health records (EHR) have been combined with genetic data to uncover disease biology and accelerate generation of hypotheses for drug development and treatment strategies. The goal of this dissertation is to develop novel statistical models that can address the challenges of analyzing ‘imperfect’ EHR data and to propose privacy-preserving methods that enable sensitive individual-level data sharing across EHR studies and other large genetic studies. In Chapter II, we propose a statistical method to address misclassified clinical outcomes, a common challenge in EHR data. One essential step of EHR-based genome-wide association studies is constructing a cohort of cases and controls for a specific disease from billing codes and other clinical or administrative data. Nearly always, a perfect strategy for deriving disease phenotypes from billing codes is not available, resulting in some incorrect case/control labels. Here, we propose a method to estimate the misclassification of case/control status by examining genotype information of dozens of disease associated loci. Through simulation and application to the Michigan Genomics Initiative data, we demonstrate that the method enables the evaluation of new EHR-based phenotype definition schemes and provides accurate estimates of disease association measures when phenotypes are misclassified. In Chapter III and IV, we focus on identifying overlapping samples between studies, a common challenge when aggregating information across datasets. We particularly focus on identifying duplicate or related samples when sharing the underlying individual level genetic data is restricted. We propose methods that do not require disclosure of individual identities but that can still identify genetic relatives across datasets. In Chapter III, we show that by grouping genotypes into segments and calculating summary statistics within each segment, we are able to obscure and encode individual-level genetic information. Relatives can be inferred with the coded genotypes using a likelihood model. Simulation and application to the Trans-Omics for Precision Medicine (TOPMed) program data demonstrate the utility and security of the method. In Chapter IV, we extend the method further, with a strategy that guarantees stronger encryption and is expected to work across heterogeneous populations. This secure protocol can infer genetic relatives among people of diverse ethnic backgrounds. The method works by combining a cryptographic technique, homomorphic encryption, with the robust relationship inference method previously described by Manichaikul et al (2010). Through simulations, we show that our method's performance is identical to that of implementations that use the original unencrypted genotypes. Our protocol scales well in computing time and is protected from several possible attacks. The secure protocol was again applied to TOPMed dataset. Securely identifying related samples will facilitate combination of results across datasets when there are restrictions to sharing the underlying individual level data. In conclusion, the methods developed here well enhance use of EHR data and genome data to improve accuracy of case/control status as well as decrease inclusion of relatives across studies when desired.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153415/1/xtzhao_1.pd

    I-GWAS: Privacy-Preserving Interdependent Genome-Wide Association Studies

    Full text link
    Genome-wide Association Studies (GWASes) identify genomic variations that are statistically associated with a trait, such as a disease, in a group of individuals. Unfortunately, careless sharing of GWAS statistics might give rise to privacy attacks. Several works attempted to reconcile secure processing with privacy-preserving releases of GWASes. However, we highlight that these approaches remain vulnerable if GWASes utilize overlapping sets of individuals and genomic variations. In such conditions, we show that even when relying on state-of-the-art techniques for protecting releases, an adversary could reconstruct the genomic variations of up to 28.6% of participants, and that the released statistics of up to 92.3% of the genomic variations would enable membership inference attacks. We introduce I-GWAS, a novel framework that securely computes and releases the results of multiple possibly interdependent GWASes. I-GWAS continuously releases privacy-preserving and noise-free GWAS results as new genomes become available
    corecore