346 research outputs found
An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
10.1186/1471-2156-14-82BMC Genetics14-BGME
Haplotype frequency inference from pooled genetic data with a latent multinomial model
In genetic studies, haplotype data provide more refined information than data
about separate genetic markers. However, large-scale studies that genotype
hundreds to thousands of individuals may only provide results of pooled data,
where only the total allele counts of each marker in each pool are reported.
Methods for inferring haplotype frequencies from pooled genetic data that scale
well with pool size rely on a normal approximation, which we observe to produce
unreliable inference when applied to real data. We illustrate cases where the
approximation breaks down, due to the normal covariance matrix being
near-singular. As an alternative to approximate methods, in this paper we
propose exact methods to infer haplotype frequencies from pooled genetic data
based on a latent multinomial model, where the observed allele counts are
considered integer combinations of latent, unobserved haplotype counts. One of
our methods, latent count sampling via Markov bases, achieves approximately
linear runtime with respect to pool size. Our exact methods produce more
accurate inference over existing approximate methods for synthetic data and for
data based on haplotype information from the 1000 Genomes Project. We also
demonstrate how our methods can be applied to time-series of pooled genetic
data, as a proof of concept of how our methods are relevant to more complex
hierarchical settings, such as spatiotemporal models.Comment: 35 pages, 16 figures, 3 algorithms, submitted to Biometrics journa
Statistical and Computational Methods for Analyzing and Visualizing Large-Scale Genomic Datasets
Advances in large-scale genomic data production have led to a need for better methods to process, interpret, and organize this data. Starting with raw sequencing data, generating results requires many complex data processing steps, from quality control, alignment, and variant calling to genome wide association studies (GWAS) and characterization of expression quantitative trait loci (eQTL). In this dissertation, I present methods to address issues faced when working with large-scale genomic datasets. In Chapter 2, I present an analysis of 4,787 whole genomes sequenced for the study of age-related macular degeneration (AMD) as a follow-up fine-mapping study to previous work from the International AMD Genomics Consortium (IAMDGC). Through whole genome sequencing, we comprehensively characterized genetic variants associated with AMD in known loci to provide additional insights on the variants potentially responsible for the disease by leveraging 60,706 additional controls. Our study improved the understanding of loci associated with AMD and demonstrated the advantages and disadvantages of different approaches for fine-mapping studies with sequence-based genotypes. In Chapter 3, I describe a novel method and a software tool to perform Hardy-Weinberg equilibrium (HWE) tests for structured populations. In sequence-based genetic studies, HWE test statistics are important quality metrics to distinguish true genetic variants from artifactual ones, but it becomes much less informative when it is applied to a heterogeneous and/or structured population. As next generation sequencing studies contain samples from increasingly diverse ancestries, we developed a new HWE test which addresses both the statistical and computational challenges of modern large-scale sequencing data and implemented the method in a publicly available software tool. Moreover, we extensively evaluated our proposed method with alternative methods to test HWE in both simulated and real datasets. Our method has been successfully applied to the latest variant calling QC pipeline in the TOPMed project. In Chapter 4, I describe PheGET, a web application to interactively visualize Expression Quantitative Trait Loci (eQTLs) across tissues, genes, and regions to aid functional interpretations of regulatory variants. Tissue-specific expression has become increasingly important for understanding the links between genetic variation and disease. To address this need, the Genotype-Tissue Expression (GTEx) project collected and analyzed a treasure trove of expression data. However, effectively navigating this wealth of data to find signals relevant to researchers has become a major challenge. I demonstrate the functionalities of PheGET using the newest GTEx data on our eQTL browser website at https://eqtl.pheweb.org/, allowing the user to 1) view all cis-eQTLs for a single variant; 2) view and compare single-tissue, single-gene associations within any genomic region; 3) find the best eQTL signal in any given genomic region or gene; and 4) customize the plotted data in real time. PheGET is designed to handle and display the kind of complex multidimensional data often seen in our post-GWAS era, such as multi-tissue expression data, in an intuitive and convenient interface, giving researchers an additional tool to better understand the links between genetics and disease.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162918/1/amkwong_1.pd
Analysis of the population genetics of the Han and Hui of Liaoning province, Peoples Republic of China
Throughout recorded Chinese history, regions of the country populated by persons of non-Han ancestry often fluctuated significantly in population numbers and in their political and commercial influence. However, at all times they were considered as important contributors to the nation. Many of these peoples had moved from their homelands, settled in China and had intermarried with Han Chinese. Over the generations they became accepted as fully-fledged Chinese citizens although, in many instances, they retained their traditional customs and religious practices, and frequently their own language. The Hui Muslims are a good example of this process of integration, and today they comprise some 8.6 million individuals thus forming approximately half of the total Muslim population of PR China. The purpose of this study is to investigate the genetic structure of two populations, the Han and Hui of Liaoning, Northeast PR China. The study seeks to provide a better understanding of the effect of population subdivision on the genetic diversity of human populations, by comparing genome-based investigations using single tandem repeat markers with historical and anthropological information. As the Hui of Liaoning are endogamous, and they are known to contract consanguineous marriages, the study also attempts to assess the effect of consanguinity on overall genetic diversity in the Hui. Genetic analysis of the Han and Hui was undertaken by surveying the allele distribution patterns at ten autosomal and seven Y-chromosome microsatellite loci in both study populations. Various population genetic techniques were applied, based either on the Infinite Allele Mutation model or the Stepwise Mutation Model. It was found that both the Han and the Hui exhibited appreciable heterogeneity at autosomal and Y-chromosome loci, indicative of the presence of population substructure and that the AMOV A test best defined genetic relationship between two populations. It was concluded that further detailed anthropological and demographic information was needed to provide a more detailed account of population structure and for the creation of a detailed phylogeny tracing male Hui gene flow. It was also found that consanguinity seemed to have a negligible effect on the genetic diversity of the Hui population of Liaoning. It was concluded that either the practice of consanguinity had not occurred over a sufficiently long time period to alter overall genetic diversity or that heterozygote advantage may be operating at various loci
Recommended from our members
Genomic, patterns of selection and differentiation in African populations and implications for mapping disease association
The main objective of this thesis is to gain a better understanding of genomic patterns of natural selection and population differentiation in Africa, where there is great genetic diversity, and of the implications for genetic mapping of complex diseases.
I began by studying two neighbouring villages in eastern Sudan that are of different ethnicity, Hausa and Masalit, and that appear to have different susceptibility to malaria and visceral leishmaniasis (VL). Specifically, I investigated patterns of linkage disequilibrium (LD) and haplotypic signals of positive selection in the 5q31 genomic region which contains immune genes that have been implicated in susceptibility to malaria and VL.
In my first analysis, by genotyping 34 single nucleotide polymorphisms (SNPs) in the 5q31 region, I did not find signals of selection or population differentiation between the Hausa and Masalit using available statistical methods. I conceived the idea that patterns of LD might provide a more sensitive test of population differentiation, and I developed an approach for this using permutation analysis. This method revealed differentiation between the Hausa, the Masalit and other African ethnic groups.
To better understand signals of selection, I next studied a region of the genome associated with a known malaria resistance factor, the haemoglobin S (HbS) variant of the HBB gene. By genotyping 26 SNPs in the region of the HBB gene, I observed a haplotype that extended in excess of 1 Mb, despite being at high frequency and spanning several recombinational hotspots. This long haplotype carried the HbS allele but, importantly, it could be readily detected without typing the HbS variant.
Building on this observation, I designed a new method to screen the whole genome for long haplotypes that might be signals of selection, and developed a software programme to implement this method. I validated this method using haplotypic data for the Yoruba generated by the HapMap project and complemented by additional SNP data that I generated on HapMap cell lines, and found that the HbS allele resides on a haplotype that extends to 1.2 Mb, and is at strikingly high frequency compared to other haplotypes of similar length on the same chromosome.
Next I applied this method to a large family-based association study of severe malaria in The Gambia, and identified several novel genomic regions with unusually long haplotypes of high frequency. These included a number of regions that may be associated with resistance to severe malaria, and which merit further investigation
Statistical aspects of haplotype-based association studies
A decade ago, genomewide association studies were proposed as a tool to unravel the genetic basis of complex diseases. It is only now that they are becoming practical realities due to improved technology and reduced genotyping costs. For such studies, the issues of power and efficiency are crucial due to the quantity of markers genotyped and the moderate effect sizes involved. Haplotype-based analysis incorporates information from multiple markers, and so is potentially more powerful than single-SNP analysis. Unfortunately, not only is it computationally more intensive, but since haplotypes are not directly observed, there exists a major analytical challenge with haplotype association analysis. Several methods are available to infer individual haplotypes from unphased genotype data, but using the inferred haplotypes in the ensuing association analysis can result in biased estimates and reduced power. We investigate the situations for which the disadvantages of the imputation process may outweigh its convenience. In addition, we describe alternatives to imputation which result in efficient haplotype association analysis. For case-control studies, we develop methods for use in genomewide studies which account for the correlation between SNPs in multiple test correction. Simulation studies based on the HapMap data showed that the proposed method performs well in realistic situations. We applied it to a case-control dataset of 2,300 SNPs to test for association with rheumatoid arthritis. For quantitative trait loci, we focus on gains in power which may be made via selective genotyping designs, where only those individuals with extreme phenotypes are genotyped. Because selection depends on the phenotype, the resulting data cannot be properly analyzed by standard statistical methods. We provide appropriate likelihoods for assessing the effects of genotypes and haplotypes on quantitative traits under such designs. We demonstrate that the likelihood-based methods are highly effective in identifying causal variants, and are substantially more powerful than existing methods. We initially consider two practical designs, then extend the methods to a two-phase sampling design. Additionally, we provide methods to test for haplotype-disease association in the presence of covariates. Simulations demonstrate the effectiveness of these likelihood-based methods
Estimation Based on Pooled Data in Human Biomonitoring and Statistical Genetics
Ph.DDOCTOR OF PHILOSOPH
- …