3,014 research outputs found

    Deep-coverage whole genome sequences and blood lipids among 16,324 individuals.

    Get PDF
    Large-scale deep-coverage whole-genome sequencing (WGS) is now feasible and offers potential advantages for locus discovery. We perform WGS in 16,324 participants from four ancestries at mean depth >29X and analyze genotypes with four quantitative traits-plasma total cholesterol, low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol, and triglycerides. Common variant association yields known loci except for few variants previously poorly imputed. Rare coding variant association yields known Mendelian dyslipidemia genes but rare non-coding variant association detects no signals. A high 2M-SNP LDL-C polygenic score (top 5th percentile) confers similar effect size to a monogenic mutation (~30 mg/dl higher for each); however, among those with severe hypercholesterolemia, 23% have a high polygenic score and only 2% carry a monogenic mutation. At these sample sizes and for these phenotypes, the incremental value of WGS for discovery is limited but WGS permits simultaneous assessment of monogenic and polygenic models to severe hypercholesterolemia

    Statistical and Computational Methods for Genome-Wide Association Analysis

    Full text link
    Technological and scientific advances in recent years have revolutionized genomics. For example, decreases in whole genome sequencing (WGS) costs have enabled larger WGS studies as well as larger imputation reference panels, which in turn provide more comprehensive genomic coverage from lower-cost genotyping methods. In addition, new technologies and large collaborative efforts such as ENCODE and GTEx have shed new light on regulatory genomics and the function of non-coding variation, and produced expansive publicly available data sets. These advances have introduced data of unprecedented size and dimension, unique statistical and computational challenges, and numerous opportunities for innovation. In this dissertation, we develop methods to leverage functional genomics data in post-GWAS analysis, to expedite routine computations with increasingly large genetic data sets, and to address limitations of current imputation reference panels for understudied populations. In Chapter 2, we propose strategies to improve imputation and increase power in GWAS of understudied populations. Genotype imputation is instrumental in GWAS, providing increased genomic coverage from low-cost genotyping arrays. Imputation quality depends crucially on reference panel size and the genetic distance between reference and target haplotypes. Current reference panels provide excellent imputation quality in many European populations, but lower quality in non-European, admixed, and isolate populations. We consider a GWAS strategy in which a subset of participants is sequenced and the rest are imputed using a reference panel that comprises the sequenced participants together with individuals from an external reference panel. Using empirical data from the HRC and TOPMed WGS Project, simulations, and asymptotic analysis, we identify powerful and cost-effective study designs for GWAS of non-European, admixed, and isolated populations. In Chapter 3, we develop efficient methods to estimate linkage disequilibrium (LD) with large data sets. Motivated by practical and logistical constraints, a variety of statistical methods and tools have been developed for analysis of GWAS summary statistics rather than individual-level data. These methods often rely on LD estimates from an external reference panel, which are ideally calculated on-the-fly rather than precomputed and stored. We develop efficient algorithms to estimate LD exploiting sparsity and haplotype structure and implement our methods in an open-source C++ tool, emeraLD. We benchmark performance using genotype data from the 1KGP, HRC, and UK Biobank, and find that emeraLD is up to two orders of magnitude faster than existing tools while using comparable or less memory. In Chapter 4, we develop methods to identify causative genes and biological mechanisms underlying associations in post-GWAS analysis by leveraging regulatory and functional genomics databases. Many gene-based association tests can be viewed as instrumental variable methods in which intermediate phenotypes, e.g. tissue-specific expression or protein alteration, are hypothesized to mediate the association between genotype and GWAS trait. However, LD and pleiotropy can confound these statistics, which complicates their mechanistic interpretation. We develop a hierarchical Bayesian model that accounts for multiple potential mechanisms underlying associations using functional genomic annotations derived from GTEx, Roadmap/ENCODE, and other sources. We apply our method to analyze twenty-five complex traits using GWAS summary statistics from UK Biobank, and provide an open-source implementation of our methods. In Chapter 5, we review our work, discuss its relevance and prospects as new resources emerge, and suggest directions for future research.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147697/1/corbinq_1.pd

    Leveraging auxiliary data from arbitrary distributions to boost GWAS discovery with Flexible cFDR.

    Get PDF
    Funder: GlaxoSmithKlineGenome-wide association studies (GWAS) have identified thousands of genetic variants that are associated with complex traits. However, a stringent significance threshold is required to identify robust genetic associations. Leveraging relevant auxiliary covariates has the potential to boost statistical power to exceed the significance threshold. Particularly, abundant pleiotropy and the non-random distribution of SNPs across various functional categories suggests that leveraging GWAS test statistics from related traits and/or functional genomic data may boost GWAS discovery. While type 1 error rate control has become standard in GWAS, control of the false discovery rate can be a more powerful approach. The conditional false discovery rate (cFDR) extends the standard FDR framework by conditioning on auxiliary data to call significant associations, but current implementations are restricted to auxiliary data satisfying specific parametric distributions, typically GWAS p-values for related traits. We relax these distributional assumptions, enabling an extension of the cFDR framework that supports auxiliary covariates from arbitrary continuous distributions ("Flexible cFDR"). Our method can be applied iteratively, thereby supporting multi-dimensional covariate data. Through simulations we show that Flexible cFDR increases sensitivity whilst controlling FDR after one or several iterations. We further demonstrate its practical potential through application to an asthma GWAS, leveraging various functional genomic data to find additional genetic associations for asthma, which we validate in the larger, independent, UK Biobank data resource

    Understanding the assumptions underlying Mendelian randomization

    Get PDF
    With the rapidly increasing availability of large genetic data sets in recent years, Mendelian Randomization (MR) has quickly gained popularity as a novel secondary analysis method. Leveraging genetic variants as instrumental variables, MR can be used to estimate the causal effects of one phenotype on another even when experimental research is not feasible, and therefore has the potential to be highly informative. It is dependent on strong assumptions however, often producing biased results if these are not met. It is therefore imperative that these assumptions are well-understood by researchers aiming to use MR, in order to evaluate their validity in the context of their analyses and data. The aim of this perspective is therefore to further elucidate these assumptions and the role they play in MR, as well as how different kinds of data can be used to further support them

    Genome-wide Association Studies in Ancestrally Diverse Populations: Opportunities, Methods, Pitfalls, and Recommendations

    Get PDF
    Genome-wide association studies (GWASs) have focused primarily on populations of European descent, but it is essential that diverse populations become better represented. Increasing diversity among study participants will advance our understanding of genetic architecture in all populations and ensure that genetic research is broadly applicable. To facilitate and promote research in multi-ancestry and admixed cohorts, we outline key methodological considerations and highlight opportunities, challenges, solutions, and areas in need of development. Despite the perception that analyzing genetic data from diverse populations is difficult, it is scientifically and ethically imperative, and there is an expanding analytical toolbox to do it well

    Methods for large-scale genome-wide association studies

    Get PDF
    Genome-wide association studies (GWAS) have led to the identification of thousands of associations between genetic polymorphisms and complex traits or diseases, facilitating several downstream applications such as genetic risk prediction and drug target prioritisation. Biobanks containing extensive genetic and phenotypic data continue to grow, creating new opportunities for the study of complex traits, such as the analysis of rare genomic variation across multiple populations. These opportunities are coupled with computational challenges, creating the need for the development of novel methodology. This thesis develops computational tools to facilitate large-scale association studies of rare and common variation. First, we develop methods to improve the analysis of ultra-rare variants, leveraging the sharing of identical-by-descent (IBD) genomic regions within large biobanks. We compare ∼ 400k genotyped UK Biobank (UKBB) samples with 50k exome-sequenced samples and devise a score that quantifies the extent to which a genotyped individual shares IBD segments with carriers of rare loss-of-function mutations. Our approach detects several associations and replicates 11/14 loci of a pilot exome sequencing study. Second, we develop a linear mixed model framework, FMA, that builds on previous techniques and is suitable for scalable and robust association testing. We benchmark FMA and several state-of-the-art approaches using synthetic and UKBB data, evaluating computational performance, statistical power, and robustness to known confounders, such as cryptic relatedness and population stratification. Finally, we integrate FMA with recently developed methods for genealogical analysis of complex traits, enabling it to perform scalable genealogy-based estimation of narrow-sense heritability and association

    A reference haplotype panel for genome-wide imputation of short tandem repeats.

    Get PDF
    Short tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in complex traits. However, genotyping arrays used in genome-wide association studies focus on single nucleotide polymorphisms (SNPs) and do not readily allow identification of STR associations. We leverage next-generation sequencing (NGS) from 479 families to create a SNP + STR reference haplotype panel. Our panel enables imputing STR genotypes into SNP array data when NGS is not available for directly genotyping STRs. Imputed genotypes achieve mean concordance of 97% with observed genotypes in an external dataset compared to 71% expected under a naive model. Performance varies widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic repeats. Imputation increases power over individual SNPs to detect STR associations with gene expression. Imputing STRs into existing SNP datasets will enable the first large-scale STR association studies across a range of complex traits

    Efficient network-guided multi-locus association mapping with graph cuts

    Get PDF
    As an increasing number of genome-wide association studies reveal the limitations of attempting to explain phenotypic heritability by single genetic loci, there is growing interest for associating complex phenotypes with sets of genetic loci. While several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci, or do not scale to genome-wide settings. We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype, while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints that can be solved exactly and rapidly. SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci, and exhibits higher power in detecting causal SNPs in simulation studies than existing methods. On flowering time phenotypes and genotypes from Arabidopsis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature. Matlab code for SConES is available at http://webdav.tuebingen.mpg.de/u/karsten/Forschung/scones/Comment: 20 pages, 6 figures, accepted at ISMB (International Conference on Intelligent Systems for Molecular Biology) 201
    • …
    corecore