33 research outputs found

    Optimal strategies for learning multi-ancestry polygenic scores vary across traits

    Get PDF
    Polygenic scores (PGSs) are individual-level measures that aggregate the genome-wide genetic predisposition to a given trait. As PGS have predominantly been developed using European-ancestry samples, trait prediction using such European ancestry-derived PGS is less accurate in non-European ancestry individuals. Although there has been recent progress in combining multiple PGS trained on distinct populations, the problem of how to maximize performance given a multiple-ancestry cohort is largely unexplored. Here, we investigate the effect of sample size and ancestry composition on PGS performance for fifteen traits in UK Biobank. For some traits, PGS estimated using a relatively small African-ancestry training set outperformed, on an African-ancestry test set, PGS estimated using a much larger European-ancestry only training set. We observe similar, but not identical, results when considering other minority-ancestry groups within UK Biobank. Our results emphasise the importance of targeted data collection from underrepresented groups in order to address existing disparities in PGS performance

    The statistical analysis of genetic sequencing and rare variant association studies

    Get PDF
    Understanding the role of genetic variability in complex traits is a central goal of modern human genetics research. So far, genome wide association tests have not been able to discover SNPs that explain a large proportion of the heritability of disease. It is hoped that with the advent of accessible DNA sequencing data, investigators can uncover more of the so-called missing heritability. The added information contained in sequencing data includes rare variants, that is, minor alleles whose population frequency is low. We examine several existing region based rare variant association tests including burden based tests and similarity based tests and show that each is most powerful under a certain set of conditions which is unknown to the investigator. While some have proposed tests that combine the features of several existing tests, none as yet has provided a test to combine the features of all existing tests. Here, we propose one such test under the framework of the SKAT test, and show that it is nearly as powerful as the most appropriately chosen test under a range of scenarios. Existing methods do not allow for missing values in the covariates. Standard use of complete case analysis may yield misleading results, including false positives and biased parameter estimates. To address this problem, we extend an existing maximum likelihood strategy for accommodating partially missing covariates to the SKAT framework for rare variant association testing. This results in a test with high power to identify genetic regions associated with quantitative traits while still providing unbiased estimation and correct control of type I error when covariates are missing at random. Since the framework is generic, we also consider the application of this approach to epigenetic data. A wide range of variable selection approaches can be applied to isolate individual rare variants within a region, yet there has been little evaluation of these approaches. We examine key methods for prioritizing individual variants and examine how these procedures perform with respect to false positives and power via application to simulated data and real data.Doctor of Philosoph

    Identification of genomic factors using family-based association studies

    Get PDF
    Genome-wide association studies become increasingly popular and important for detecting genetic associations of complex traits. However, it is well known that spurious associations could arise from statistical analysis without proper consideration of genetic relatedness of samples. Many methods have been proposed to guard against these spurious associations. Here we focus on multi-locus association studies of quantitative traits and the case-control status, and propose algorithms that take into consideration of genetic related samples to address possible confounding issues. As supervised dimension reduction methods, these algorithms performs well to conduct association studies with a large number of biomarkers but a relative small number of samples.^ Recently, Linear mixed models have demonstrated its efficiency in GWAS of quantitative traits with multiple levels of sample structures. Most of the current mixed model based methods such as EMMA, EMMAX, and GEMMA, can be viewed as single-locus methods by testing each SNP separately. Complex traits, however, are known to be controlled by multiple loci, thus including multiple loci in the statistical model seems more appropriate. In the first part of my dissertation, we propose an algorithm that extends penalized orthogonal component regression to family-based association studies (fPOCRE) of continuous traits. While multiple loci can be investigated at the same time, the sample relatedness is modeled through the kinship matrix and the shared confounding effects are included as random effects in the linear mixed model. Our proposed algorithm simultaneously selects biomarkers and constructs their linear combinations as components which optimally account for variation in traits. We compare fPOCRE with EMMAX, which is one of the most frequently used single-locus approach, and also compare it with MLMM, a recently developed multi-locus approach. Our simulation study demonstrates fPOCRE has promising performance over both EMMAX and MLMM in terms of higher power and fewer false positives when causal effects are from clusters of correlated SNPs. Real data are analyzed to illustrate the proposed approach and provide further comparisons.^ Case-control association study is a widely used study design in genetic epidemiology and pharmacology and this study design is also susceptible to the potential confounding by sample structure. In the second part of my dissertation, we employ a multi-locus generalized estimation equation (GEE) model to study genetic associations of binary traits, capturing multiple levels of the sample structure with working correlation matrix. The kinship matrix is used to model the working correlation matrix, and the penalized orthogonal-components regression method is developed to build such a multi-locus GEE model (aka GEE-POCRE). GEE-POCRE is compared with gPOCRE, a multi-locus method that does not consider pedigree information, also compared with TDT, FBAT, and ROADTRIPS that are single-locus methods considering sample structure. In our simulation studies, GEE-POCRE demonstrates good performance in terms of protecting against spurious associations caused by the sample structure as well as having increased power

    Variable selection in varying coefficient models for mapping quantitative trait loci

    Get PDF
    The Collaborative Cross (CC), a renewable mouse resource that mimics the genetic diversity in humans, provides great data sources for mapping Quantitative Trait Loci (QTL). The recombinant inbred intercrosses (RIX) generated from CC recombinant inbred (RI) lines have several attractive features and can be produced repeatedly. Many quantitative traits are inherently complex and change with other covariates. To map such complex traits, phenotypes are measured across multiple values of covariates on each subject. In the first topic, we propose a more flexible nonparametric varying coefficient QTL mapping method for RIX data. This model lets the QTL effects evolve with certain covariates, and naturally extends classical parametric QTL mapping methods. Simulation results indicate that the varying coefficient QTL mapping has substantially higher power and higher mapping precision compared to parametric models when the assumption of constant genetic effects fails. We model the time-varying genetic effects with functional approximation using B-spline basis. We apply a nested permutation method to obtain threshold values for QTL detection. In the second topic, we extend the single marker QTL mapping to multiple QTL mapping. We treat multiple QTL mapping as a model/variable selection problem and propose a penalized mixed effects model. We apply a penalty function for the group selection of coefficients associated with each gene. We propose new selection procedures for tuning parameters. Simulations showed that the new mapping method performs better than the single marker analysis when multiple QTL exist. Last, in the third topic, we extend the multiple QTL mapping method to longitudinal data. We pay special attention to modeling the covariance structure of repeated measurements. Popular stationary assumptions on variance and covariance structures may not be realistic for many longitudinal traits. The structured antedependence (SAD) model is a parsimonious covariance model that allows for both nonstationary variance and correlation. We propose a penalized likelihood method for multiple QTL mapping using the SAD model. Simulation results showed the model selection method outperforms the single marker analysis. Furthermore, the performance of multiple QTL mapping will be affected if the covariance model is misspecified

    Research Review: A guide to computing and implementing polygenic scores in developmental research

    Get PDF
    The increasing availability of genotype data in longitudinal population- and family-based samples provides opportunities for using polygenic scores (PGS) to study developmental questions in child and adolescent psychology and psychiatry. Here, we aim to provide a comprehensive overview of how PGS can be generated and implemented in developmental psycho(patho)logy, with a focus on longitudinal designs. As such, the paper is organized into three parts: First, we provide a formal definition of polygenic scores and related concepts, focusing on assumptions and limitations. Second, we give a general overview of the methods used to compute polygenic scores, ranging from the classic approach to more advanced methods. We include recommendations and reference resources available to researchers aiming to conduct PGS analyses. Finally, we focus on the practical applications of PGS in the analysis of longitudinal data. We describe how PGS have been used to research developmental outcomes, and how they can be applied to longitudinal data to address developmental questions

    패스웨이 정보를 이용한 대용량 유전체 자료의 통계적 분석

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 자연과학대학 협동과정 생물정보학전공, 2018. 2. 박태성.hence our method considers the correlation of pathways and handles an entire dataset in a single model. In addition, PHARAOH-multi further extends the original model into multivariate analysis, while keeping the advantages of our previous approach. We extend PHARAOH to enable analysis of multiple traits using hierarchical components of genetic variants. In addition, PHARAOH-multi can identify associations between multiple phenotypes and multiple pathways, with a single model, in the presence of subsequent genes within pathways, as a hierarchy. Through simulation studies, PHARAOH was shown to have higher statistical power than the existing pathway-based methods. In addition, a detailed simulation study for PHARAOH-multi demonstrated advantages of multivariate analysis, compared to univariate analysis, and comparison studies showed the proposed approach to outperform existing multivariate pathway-based methods. Finally, we conducted an analysis of whole-exome sequencing data from a Korean population study to compare the performance between the proposed methods with the previous pathway-based methods, using validated pathway databases. As a result, PHARAOH successfully discovered 13 pathways for the liver enzymes, and PHARAOH-multi identified 8 pathways for multiple metabolic traits. Through a replication study using an independent, large-scale exome chip dataset, we replicated many pathways that were discovered by the proposed methods and showed their biological relationship to the target traits.In the past two decades, rapid advances in DNA sequencing technology have enabled extensive investigations into human genetic architecture, especially for the identification of genetic variants associated with complex traits. In particular, genome-wide association studies (GWAS) have played a key role in identifying genetic associations between Single Nucleotide Variants (SNVs) and many complex biological pathologies. However, the genetic variants identified by many successful GWAS have explained only a modest part of heritability for most of phenotypes, and many hypotheses have been proposed to address so-called missing heritability issue, such as rare variant association, gene-gene interaction or multi-omics integration. Methods for rare variants analysis arose from extending individual variant-level approaches to those at the gene-level, and extending those at the gene level to multiple phenotypes. In this trend, as the number of publicly available biological resources is increasing, recent methods for analyzing rare variants utilize pathway knowledge as a priori information. In this respect, many statistical methods for pathway-based analyses using rare variants have been proposed to analyze pathways individually. However, neglecting correlations between multiple pathways can result in misleading solutions, and pathway-based analyses of large-scale genetic datasets require massive computational burden. Moreover, while a number of methods for pathway-based rare-variant analysis of multiple phenotypes have been proposed, no method considers a unified model that incorporate multiple pathways. In this thesis, we propose novel statistical methods to analyze large-scale genetic dataset using pathway information, Pathway-based approach using HierArchical components of collapsed RAre variants Of High-throughput sequencing data (PHARAOH) and PHARAOH-multi. PHARAOH extends generalized structural component analysis, and implements the method based on the framework of generalized linear models, to accommodate phenotype data arising from a variety of exponential family distributions. PHARAOH constructs a single hierarchical model that consists of collapsed gene-level summaries and pathways, and analyzes entire pathways simultaneously by imposing ridge-type penalties on both gene and pathway coefficient estimatesIntroduction 1 1.1. The background on genetic association studies 1 1.1.1. Genome-wide association studies and the missing heritability 1 1.1.2. Rare variant analyses 6 1.2. The purpose of this study 10 1.3. Outline of the thesis 12 An overview of existing methods 13 2.1. Review of pathway-based methods 13 2.2.1. Competitive and self-contained tests: WKS and DRB 16 2.2.2. Self-contained test: aSPU 19 2.2.3. Self-contained test: MARV 21 2.3. Generalized structured component analysis 23 2.3.1. The model 23 2.3.2. Parameter estimation 25 Pathway-based approach using rare variants 27 3.1. Introduction 27 3.2. Methods 29 3.2.1. Notations and the model 29 3.2.2. An exemplary structure 32 3.3.3. Parameter estimation 33 3.4. Simulation study 37 3.4.1. The simulation dataset 37 3.4.2. Comparison of methods using simulation dataset 38 3.5. Application to analysis of liver enzymes 44 3.5.1. Whole exome sequencing dataset for pathway discovery 44 3.5.2. Replication study using exome chip dataset 53 3.6. Discussion 56 Multivariate pathway-based approach using rare variants 60 4.1. Introduction 60 4.2. Methods 61 4.2.1. Notations and the model 61 4.2.2. An exemplary structure 63 4.2.3. Parameter estimation 66 4.2.4. Significance testing 69 4.2.5. Multiple testing correction 75 4.3. Simulation study 77 4.3.1. The simulation model 74 4.3.2. Evaluation with simulated data 88 4.4. Application to the real datasets 88 4.4.1. Real data discovery from whole-exome sequencing dataset 95 4.4.2. Replication study using independent exome chip dataset 98 4.5. Discussion 99 Summary & Conclusions 104 Bibliography 108 초 록 127Docto

    FINEMAP : a statistical method for identifying causal genetic variants

    Get PDF
    The explosion of genomic data during the last ten years and the advent of Genome-Wide Association Studies (GWAS) have led to robust statistical associations between thousands of genomic regions and hundreds of phenotypes. However, any one associated genomic region can harbor thousands of correlated genetic variants, complicating the understanding of the underlying biological mechanisms that led to these associations. To address this problem, this doctoral thesis presents the development of the FINEMAP software for fine-mapping causal variants in these regions. In 2016, we solved the existing issue with the computationally expensive exhaustive search strategy of existing fine-mapping methods by implementing a Bayesian regression model and an ultrafast stochastic search algorithm in the FINEMAP software. We demonstrated that FINEMAP opens up completely new opportunities by fine-mapping the High Density Lipoprotein (HDL) cholesterol association to the LIPC locus with 20,000 variants in less than 90 seconds, while exhaustive search would require many years. With extensive simulations we further showed that FINEMAP is as accurate as exhaustive search when the latter can be completed and achieves even higher accuracy when the latter must be restricted due to computational reasons. Thus, FINEMAP is a promising tool for future fine-mapping analyses. Fine-mapping methods that use GWAS results also require Linkage Disequilibrium (LD) information as input in the form of estimates of pairwise correlations between variants. Motivated by feedback from FINEMAP users, we investigated in 2017 the consequences of misspecification of LD that could happen when publicly available reference genomes are used. We demonstrated both empirically and theoretically that the size of the reference panel needs to scale with the GWAS sample size to produce accurate results and we provided the LDstore software to help share LD estimates. This finding has important consequences for the application of all fine-mapping methods using GWAS results from GW AS consortia in which accurate LD estimates from each participating study are typically not available. In 2018, we implemented in FINEMAP an approach for estimating how much phenotypic variation can be explained by the causal variants. To demonstrate this, we applied FINEMAP to 110 regions across 51 biomarkers on 5,265 Finnish samples. We compared regional heritability estimation using FINEMAP with both the variance component model BOLT and fixed-effect model HESS in biomarker-associated regions, showing good concordance among all methods. Through simulations with biobank-scale projects, we also illustrated how violations of model assumptions on polygenicity or unspecified genetic architecture induces inaccuracy to the existing heritability estimates that becomes more accentuated as statistical power to identify causal variants increases. Ever increasing GWAS sample sizes, soon reaching millions of samples, provide unprecedented statistical power to decompose heritability estimates from polygenic models into heritability contributions from causal variants. In conclusion, this doctoral thesis shows that (1) the computational efficiency and accuracy of FINEMAP makes it a promising fine-mapping tool, (2) LD estimates need to be chosen more carefully than previously thought to avoid bias, and (3) large-scale data sets provide new opportunities for fine-mapping to deduce a variant-level picture of regional genetic architecture
    corecore