346 research outputs found

    Haplotype frequency inference from pooled genetic data with a latent multinomial model

    Full text link
    In genetic studies, haplotype data provide more refined information than data about separate genetic markers. However, large-scale studies that genotype hundreds to thousands of individuals may only provide results of pooled data, where only the total allele counts of each marker in each pool are reported. Methods for inferring haplotype frequencies from pooled genetic data that scale well with pool size rely on a normal approximation, which we observe to produce unreliable inference when applied to real data. We illustrate cases where the approximation breaks down, due to the normal covariance matrix being near-singular. As an alternative to approximate methods, in this paper we propose exact methods to infer haplotype frequencies from pooled genetic data based on a latent multinomial model, where the observed allele counts are considered integer combinations of latent, unobserved haplotype counts. One of our methods, latent count sampling via Markov bases, achieves approximately linear runtime with respect to pool size. Our exact methods produce more accurate inference over existing approximate methods for synthetic data and for data based on haplotype information from the 1000 Genomes Project. We also demonstrate how our methods can be applied to time-series of pooled genetic data, as a proof of concept of how our methods are relevant to more complex hierarchical settings, such as spatiotemporal models.Comment: 35 pages, 16 figures, 3 algorithms, submitted to Biometrics journa

    Statistical and Computational Methods for Analyzing and Visualizing Large-Scale Genomic Datasets

    Full text link
    Advances in large-scale genomic data production have led to a need for better methods to process, interpret, and organize this data. Starting with raw sequencing data, generating results requires many complex data processing steps, from quality control, alignment, and variant calling to genome wide association studies (GWAS) and characterization of expression quantitative trait loci (eQTL). In this dissertation, I present methods to address issues faced when working with large-scale genomic datasets. In Chapter 2, I present an analysis of 4,787 whole genomes sequenced for the study of age-related macular degeneration (AMD) as a follow-up fine-mapping study to previous work from the International AMD Genomics Consortium (IAMDGC). Through whole genome sequencing, we comprehensively characterized genetic variants associated with AMD in known loci to provide additional insights on the variants potentially responsible for the disease by leveraging 60,706 additional controls. Our study improved the understanding of loci associated with AMD and demonstrated the advantages and disadvantages of different approaches for fine-mapping studies with sequence-based genotypes. In Chapter 3, I describe a novel method and a software tool to perform Hardy-Weinberg equilibrium (HWE) tests for structured populations. In sequence-based genetic studies, HWE test statistics are important quality metrics to distinguish true genetic variants from artifactual ones, but it becomes much less informative when it is applied to a heterogeneous and/or structured population. As next generation sequencing studies contain samples from increasingly diverse ancestries, we developed a new HWE test which addresses both the statistical and computational challenges of modern large-scale sequencing data and implemented the method in a publicly available software tool. Moreover, we extensively evaluated our proposed method with alternative methods to test HWE in both simulated and real datasets. Our method has been successfully applied to the latest variant calling QC pipeline in the TOPMed project. In Chapter 4, I describe PheGET, a web application to interactively visualize Expression Quantitative Trait Loci (eQTLs) across tissues, genes, and regions to aid functional interpretations of regulatory variants. Tissue-specific expression has become increasingly important for understanding the links between genetic variation and disease. To address this need, the Genotype-Tissue Expression (GTEx) project collected and analyzed a treasure trove of expression data. However, effectively navigating this wealth of data to find signals relevant to researchers has become a major challenge. I demonstrate the functionalities of PheGET using the newest GTEx data on our eQTL browser website at https://eqtl.pheweb.org/, allowing the user to 1) view all cis-eQTLs for a single variant; 2) view and compare single-tissue, single-gene associations within any genomic region; 3) find the best eQTL signal in any given genomic region or gene; and 4) customize the plotted data in real time. PheGET is designed to handle and display the kind of complex multidimensional data often seen in our post-GWAS era, such as multi-tissue expression data, in an intuitive and convenient interface, giving researchers an additional tool to better understand the links between genetics and disease.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162918/1/amkwong_1.pd

    Analysis of the population genetics of the Han and Hui of Liaoning province, Peoples Republic of China

    Get PDF
    Throughout recorded Chinese history, regions of the country populated by persons of non-Han ancestry often fluctuated significantly in population numbers and in their political and commercial influence. However, at all times they were considered as important contributors to the nation. Many of these peoples had moved from their homelands, settled in China and had intermarried with Han Chinese. Over the generations they became accepted as fully-fledged Chinese citizens although, in many instances, they retained their traditional customs and religious practices, and frequently their own language. The Hui Muslims are a good example of this process of integration, and today they comprise some 8.6 million individuals thus forming approximately half of the total Muslim population of PR China. The purpose of this study is to investigate the genetic structure of two populations, the Han and Hui of Liaoning, Northeast PR China. The study seeks to provide a better understanding of the effect of population subdivision on the genetic diversity of human populations, by comparing genome-based investigations using single tandem repeat markers with historical and anthropological information. As the Hui of Liaoning are endogamous, and they are known to contract consanguineous marriages, the study also attempts to assess the effect of consanguinity on overall genetic diversity in the Hui. Genetic analysis of the Han and Hui was undertaken by surveying the allele distribution patterns at ten autosomal and seven Y-chromosome microsatellite loci in both study populations. Various population genetic techniques were applied, based either on the Infinite Allele Mutation model or the Stepwise Mutation Model. It was found that both the Han and the Hui exhibited appreciable heterogeneity at autosomal and Y-chromosome loci, indicative of the presence of population substructure and that the AMOV A test best defined genetic relationship between two populations. It was concluded that further detailed anthropological and demographic information was needed to provide a more detailed account of population structure and for the creation of a detailed phylogeny tracing male Hui gene flow. It was also found that consanguinity seemed to have a negligible effect on the genetic diversity of the Hui population of Liaoning. It was concluded that either the practice of consanguinity had not occurred over a sufficiently long time period to alter overall genetic diversity or that heterozygote advantage may be operating at various loci

    Statistical aspects of haplotype-based association studies

    Get PDF
    A decade ago, genomewide association studies were proposed as a tool to unravel the genetic basis of complex diseases. It is only now that they are becoming practical realities due to improved technology and reduced genotyping costs. For such studies, the issues of power and efficiency are crucial due to the quantity of markers genotyped and the moderate effect sizes involved. Haplotype-based analysis incorporates information from multiple markers, and so is potentially more powerful than single-SNP analysis. Unfortunately, not only is it computationally more intensive, but since haplotypes are not directly observed, there exists a major analytical challenge with haplotype association analysis. Several methods are available to infer individual haplotypes from unphased genotype data, but using the inferred haplotypes in the ensuing association analysis can result in biased estimates and reduced power. We investigate the situations for which the disadvantages of the imputation process may outweigh its convenience. In addition, we describe alternatives to imputation which result in efficient haplotype association analysis. For case-control studies, we develop methods for use in genomewide studies which account for the correlation between SNPs in multiple test correction. Simulation studies based on the HapMap data showed that the proposed method performs well in realistic situations. We applied it to a case-control dataset of 2,300 SNPs to test for association with rheumatoid arthritis. For quantitative trait loci, we focus on gains in power which may be made via selective genotyping designs, where only those individuals with extreme phenotypes are genotyped. Because selection depends on the phenotype, the resulting data cannot be properly analyzed by standard statistical methods. We provide appropriate likelihoods for assessing the effects of genotypes and haplotypes on quantitative traits under such designs. We demonstrate that the likelihood-based methods are highly effective in identifying causal variants, and are substantially more powerful than existing methods. We initially consider two practical designs, then extend the methods to a two-phase sampling design. Additionally, we provide methods to test for haplotype-disease association in the presence of covariates. Simulations demonstrate the effectiveness of these likelihood-based methods

    Estimation Based on Pooled Data in Human Biomonitoring and Statistical Genetics

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore