40 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationGenotype Phenotype Association (GPA) is a means to identify candidate genes and genetic variants that may contribute to phenotypic variation. Technological advances in DNA sequencing continue to improve the efficiency and accuracy of GPA. Currently, High Throughput Sequencing (HTS) is the preferred method for GPA as it is fast and economical. HTS allows for population-level characterization of genetic variation, required for GPA studies. Despite the potential power of using HTS in GPA studies, there are technical hurdles that must be overcome. For instance, the excessive error rate in HTS data and the sheer size of population-level data can hinder GPA studies. To overcome these challenges, I have written two software programs for the purpose of HTS GPA. The first toolkit, GPAT++, is designed to detect GPA using small genetic variants. Unlike pervious software, GPAT++'s association test models the inherent errors in HTS, preventing many spurious GPA. The second toolkit, Whole Genome Alignment Metrics (WHAM), was designed for GPA using large genetic variants (structural variants). By integrating both structural variant identification and association testing, WHAM can identify shared structural variants associated with a phenotype. Both GPAT++ and WHAM have been successfully applied to real-world GPA studie

    Statistical Population Genomics

    Get PDF
    This open access volume presents state-of-the-art inference methods in population genomics, focusing on data analysis based on rigorous statistical techniques. After introducing general concepts related to the biology of genomes and their evolution, the book covers state-of-the-art methods for the analysis of genomes in populations, including demography inference, population structure analysis and detection of selection, using both model-based inference and simulation procedures. Last but not least, it offers an overview of the current knowledge acquired by applying such methods to a large variety of eukaryotic organisms. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, pointers to the relevant literature, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Statistical Population Genomics aims to promote and ensure successful applications of population genomic methods to an increasing number of model systems and biological questions

    Analysis of association studies and inference of haplotypic phase using hidden Markov models

    No full text
    In this thesis I focus on the development and application of hidden Markov model (HMM) to solve problems in statistical genetics. Our method, based on a HMM, models the joint haplotype structure in the samples, where the parameters in the model are estimated by the Baum-Welch EM algorithm. Also, the model does not require pre-defined blocks or a sliding window scheme to define haplotype boundaries. Thus our method is computationally efficient and applicable for either the whole genome sequence or the candidate gene sequence. The first application of this model is for disease association testing using inferred ancestral haplotypes. We employed a HMM to cluster haplotypes into groups of predicted common ancestral haplotypes from diploid genotypes. The results from simulation studies show that our method greatly outperforms single-SNP analyses and has greater power than a haplotype-based method, CLADHC, in most simulation scenarios. The second application is for inferring haplotypic phase and to predict missing genotypes in polyploid organisms. Using a simulation study we demonstrate that the method provides accurate estimates of haplotypic phase and missing genotypes for diploids, triploids and tetraploids. The third application is for joint CNV/SNP haplotype and missing data inference. The results are very encouraging for this application. With the increasing availability of genotype data in both diploid and polyploid organisms, we believe that our programs can facilitate the investigation of genetic variations in genome-wide scale studies

    Privacy-Enhancing Technologies for Medical and Genomic Data: From Theory to Practice

    Get PDF
    The impressive technological advances in genomic analysis and the significant drop in the cost of genome sequencing are paving the way to a variety of revolutionary applications in modern healthcare. In particular, the increasing understanding of the human genome, and of its relation to diseases, health and to responses to treatments brings promise of improvements in better preventive and personalized medicine. Unfortunately, the impact on privacy and security is unprecedented. The genome is our ultimate identifier and, if leaked, it can unveil sensitive and personal information such as our genetic diseases, our propensity to develop certain conditions (e.g., cancer or Alzheimer's) or the health issues of our family. Even though legislation, such as the EU General Data Protection Regulation (GDPR) or the US Health Insurance Portability and Accountability Act (HIPAA), aims at mitigating abuses based on genomic and medical data, it is clear that this information also needs to be protected by technical means. In this thesis, we investigate the problem of developing new and practical privacy-enhancing technologies (PETs) for the protection of medical and genomic data. Our goal is to accelerate the adoption of PETs in the medical field in order to address the privacy and security concerns that prevent personalized medicine from reaching its full potential. We focus on two main areas of personalized medicine: clinical care and medical research. For clinical care, we first propose a system for securely storing and selectively retrieving raw genomic data that is indispensable for in-depth diagnoses and treatments of complex genetic diseases such as cancer. Then, we focus on genetic variants and devise a new model based on additively-homomorphic encryption for privacy-preserving genetic testing in clinics. Our model, implemented in the context of HIV treatment, is the first to be tested and evaluated by practitioners in a real operational setting. For medical research, we first propose a method that combines somewhat-homomorphic encryption with differential privacy to enable secure feasibility studies on genetic data stored at an untrusted central repository. Second, we address the problem of sharing genomic and medical data when the data is distributed across multiple mistrustful institutions. We begin by analyzing the risks that threaten patientsâ privacy in systems for the discovery of genetic variants, and we propose practical mitigations to the re-identification risk. Then, for clinical sites to be able to share the data without worrying about the risk of data breaches, we develop a new system based on collective homomorphic encryption: it achieves trust decentralization and enables researchers to securely find eligible patients for clinical studies. Finally, we design a new framework, complementary to the previous ones, for quantifying the risk of unintended disclosure caused by potential inference attacks that are jointly combined by a malicious adversary, when exact genomic data is shared. In summary, in this thesis we demonstrate that PETs, still believed unpractical and immature, can be made practical and can become real enablers for overcoming the privacy and security concerns blocking the advancement of personalized medicine. Addressing privacy issues in healthcare remains a great challenge that will increasingly require long-term collaboration among geneticists, healthcare providers, ethicists, lawmakers, and computer scientists

    Manycore Algorithms for Genetic Linkage Analysis

    Get PDF
    Exact algorithms to perform linkage analysis scale exponentially with the size of the input. Beyond a critical point, the amount of work that needs to be done exceeds both available time and memory. In these circumstances, we are forced to either abbreviate the input in some manner or else use an approximation. Approximate methods, like Markov chain Monte Carlo (MCMC), though they make the problem tractable, can take an immense amount of time to converge. The problem of high convergence time is compounded by software which is single-threaded and, as computer processors are manufactured with increasing numbers of physical processing cores, are not designed to take advantage of the available processing power. In this thesis, we will describe our program SwiftLink that embodies our work adapting existing Gibbs samplers to modern computer processor architectures. The processor architectures we target are: multicore processors, that currently feature between 4–8 processor cores, and computer graphics cards (GPUs) that already feature hundreds of processor cores. We implemented parallel versions of the meiosis sampler, that mixes well with tightly linked markers but suffers from irreducibility issues, and the locus sampler which is guaranteed to be irreducible but mixes slowly with tightly linked markers. We evaluate SwiftLink’s performance on real-world datasets of large consanguineous families. We demonstrate that using four processor cores for a single analysis is 3–3.2x faster than the single-threaded implementation of SwiftLink. With respect to the existing MCMC-based programs: it achieves a 6.6–8.7x speedup compared to Morgan and a 66.4– 72.3x speedup compared to Simwalk. Utilising both a multicore processor and a GPU performs 7–7.9x faster than the single-threaded implementation, a 17.6–19x speedup compared to Morgan and a 145.5–192.3x speedup compared to Simwalk

    Investigación de la distribución de los alelos HLA en poblaciones sanas y enfermas mediante la aplicación de nuevas metodologías de secuenciación

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Medicina, Departamento de Inmunología, Oftalmología y ORL, leída el 09/03/2021Increasing our knowledge of the HLA system, including both the complete sequence description and the assessment of its diversity at the worldwide human population-level, is of great importance for elucidating the molecular functional mechanisms of the immune system and its regulation in health and disease. Furthermore, assessment of HLA allelic and haplotypic diversity of each human population is essential in the clinical histocompatibility and transplantation setting as well as in the pharmacogenetics, immunotherapy and anthropology fields. Nevertheless, the inherent vast polymorphism and high complexity presented by the HLA system have been an important challenge for its unambiguous and in-depth (high-resolution) characterization by previously available legacy molecular HLA genotyping methods (e.g. SSP, SSO and even SBT). Recent application of novel next-generation sequencing (NGS) technology for high-resolution molecular HLA genotyping has enabled to obtain, at a high-throughput mode and larger scale, full-length and/or extended sequences and genotypes of all major HLA genes, thus overcoming most of these previous limitations. Objectives: I) Characterization of HLA allele and haplotype diversity of all major classical HLA genes (HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, -DRB1 and -DRB3/4/5) by application of NGS of a first representative cohort of the Spanish population that could also serve as a healthy control reference group. Respective statistical analyses were performed for this immunogenetic population data. II) Characterization of HLA allele and haplotype diversity of all major classical HLA genes (HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, -DRB1 and -DRB3/4/5) by application of NGS of a respective cohort of multiple sclerosis (MS) patients in the Spanish population (recruited at the Department of Neurology, Hospital Clínic, Barcelona, Catalonia, Spain). A first case-control study was carried out to examine HLA-disease associations with MS in these Spanish population cohorts as well as to attempt a fine-mapping of these allele and haplotype associations by full gene resolution level via NGS. In addition, a second analysis exercise (i.e. test case) of this case-control study was carried out using an alternative healthy control group dataset, exclusively from the Spanish northeastern region of Catalonia in this second case, to evaluate possible differences in the findings of HLA-disease association with MS due to plausible regional HLA genetic variation within mainland Spain (i.e. as a statistical way to try controlling for any possible existing population stratification)...El estudio del sistema HLA, incluyendo la descripción completa de su secuencia y de la diversidad de este complejo HLA a nivel poblacional, es de gran importancia de cara a poder entender los mecanismos moleculares y funciones del sistema inmune así como su regulación en individuos sanos y enfermos. Además, la caracterización exhaustiva de la diversidad de alelos y haplotipos HLA de cada población humana es esencial en el campo de la inmunología de trasplante e histocompatibilidad al igual que en las áreas de farmacogenética e inmunoterapia. El inmenso polimorfismo y gran complejidad que presenta el sistema HLA han sido hasta ahora importantes barreras de cara a poder caracterizarlo en gran detalle (por alta resolución) y sin ambigüedades mediante métodos de genotipaje HLA tradicionales disponibles (como son SSP, SSO o incluso SBT). La reciente aplicación de la novedosa tecnología de secuenciación masiva NGS para el genotipaje molecular HLA por alta resolución ha posibilitado obtener secuencias completas o mucho más extendidas para genotipos de los principales genes de HLA, superándose así estas previas limitaciones. Objetivos: I) Caracterización de la diversidad alélica y haplotípica de los principales genes HLA (HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, -DRB1 y -DRB3/4/5) mediante la aplicación de NGS en una primera cohorte representativa de la población española que, igualmente, constituirá una población control de referencia para estudios de asociación de HLA y enfermedades. También, respectivos análisis estadísticos se realizaron para estos resultados de genotipaje HLA. II) Caracterización de la diversidad alélica y haplotípica de los principales genes HLA (HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, -DRB1 y -DRB3/4/5) mediante la aplicación de NGS en una correspondiente cohorte de pacientes con esclerosis múltiple (EM) de la población española (reclutados y procedentes del Departamento de Neurología del Hospital Clínic (Barcelona, Cataluña)). Un primer estudio de asociación HLA tomando casos (pacientes EM) frente a controles sanos se llevó a cabo para examinar la asociación de genes HLA y la enfermedad de EM en estas cohortes de población española antes mencionadas. Así se buscaba realizar un mapeo fino de las respectivas asociaciones alélicas y haplotípicas de HLA mediante la gran resolución alélica proporcionada por esta metodología de secuenciación masiva. De modo adicional, y como un segundo ejercicio de análisis en este estudio de asociación HLA, se utilizó un grupo control sano alternativo al previo, que incluía individuos procedentes de la región de Cataluña (situada al noreste de España) exclusivamente en este caso, para evaluar así posibles diferencias dadas en la asociación de HLA con EM debido a la probable variación genética en HLA existente a nivel regional dentro del territorio de España...Fac. de MedicinaTRUEunpu

    High-depth African genomes inform human migration and health

    Get PDF
    The African continent is regarded as the cradle of modern humans and African genomes contain more genetic variation than those from any other continent, yet only a fraction of the genetic diversity among African individuals has been surveyed1. Here we performed whole-genome sequencing analyses of 426 individuals—comprising 50 ethnolinguistic groups, including previously unsampled populations—to explore the breadth of genomic diversity across Africa. We uncovered more than 3 million previously undescribed variants, most of which were found among individuals from newly sampled ethnolinguistic groups, as well as 62 previously unreported loci that are under strong selection, which were predominantly found in genes that are involved in viral immunity, DNA repair and metabolism. We observed complex patterns of ancestral admixture and putative-damaging and novel variation, both within and between populations, alongside evidence that Zambia was a likely intermediate site along the routes of expansion of Bantu-speaking populations. Pathogenic variants in genes that are currently characterized as medically relevant were uncommon—but in other genes, variants denoted as ‘likely pathogenic’ in the ClinVar database were commonly observed. Collectively, these findings refine our current understanding of continental migration, identify gene flow and the response to human disease as strong drivers of genome-level population variation, and underscore the scientific imperative for a broader characterization of the genomic diversity of African individuals to understand human ancestry and improve health
    corecore