17 research outputs found

    Identification of causal genes for complex traits.

    Get PDF
    MotivationAlthough genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations.ResultsIn this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability ρ. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2.Availability and implementationSoftware is freely available for download at genetics.cs.ucla.edu/caviar

    Using genomic annotations increases statistical power to detect eGenes.

    Get PDF
    MotivationExpression quantitative trait loci (eQTLs) are genetic variants that affect gene expression. In eQTL studies, one important task is to find eGenes or genes whose expressions are associated with at least one eQTL. The standard statistical method to determine whether a gene is an eGene requires association testing at all nearby variants and the permutation test to correct for multiple testing. The standard method however does not consider genomic annotation of the variants. In practice, variants near gene transcription start sites (TSSs) or certain histone modifications are likely to regulate gene expression. In this article, we introduce a novel eGene detection method that considers this empirical evidence and thereby increases the statistical power.ResultsWe applied our method to the liver Genotype-Tissue Expression (GTEx) data using distance from TSSs, DNase hypersensitivity sites, and six histone modifications as the genomic annotations for the variants. Each of these annotations helped us detected more candidate eGenes. Distance from TSS appears to be the most important annotation; specifically, using this annotation, our method discovered 50% more candidate eGenes than the standard permutation [email protected] or [email protected]

    Accurate estimation of SNP-heritability from biobank-scale data irrespective of genetic architecture.

    Get PDF
    SNP-heritability is a fundamental quantity in the study of complex traits. Recent studies have shown that existing methods to estimate genome-wide SNP-heritability can yield biases when their assumptions are violated. While various approaches have been proposed to account for frequency- and linkage disequilibrium (LD)-dependent genetic architectures, it remains unclear which estimates reported in the literature are reliable. Here we show that genome-wide SNP-heritability can be accurately estimated from biobank-scale data irrespective of genetic architecture, without specifying a heritability model or partitioning SNPs by allele frequency and/or LD. We show analytically and through extensive simulations starting from real genotypes (UK Biobank, N = 337 K) that, unlike existing methods, our closed-form estimator is robust across a wide range of architectures. We provide estimates of SNP-heritability for 22 complex traits in the UK Biobank and show that, consistent with our results in simulations, existing biobank-scale methods yield estimates up to 30% different from our theoretically-justified approach

    Multiple testing correction in linear mixed models.

    Get PDF
    BackgroundMultiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM.ResultsWe were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach.ConclusionsWe provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data

    Multiple testing correction in linear mixed models

    Get PDF

    Multiple testing correction in linear mixed models

    Get PDF
    BACKGROUND: Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM. RESULTS: We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach. CONCLUSIONS: We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13059-016-0903-6) contains supplementary material, which is available to authorized users

    Characterization of adiposity and inflammation genetic pleiotropy underlying cardiovascular risk factors in Hispanics.

    Get PDF
    The observed overlap between genetic variants associated with both adiposity and inflammatory markers suggests that changes in both adiposity and inflammation could be partially mediated by common pathways. The pervasive but sparsely characterized “pleiotropic” genetic variants associated with both adiposity and inflammation have been hypothesized to provide insight into the shared biology. This study explored and characterized the genetic pleiotropy underpinning adiposity and inflammation using genetic and phenotypic observations from the Cameron County Hispanic Cohort (CCHC). A total of 3,313 samples and \u3e9 million single nucleotide polymorphisms (SNPs) were examined in this study. Mixed model genome-wide association studies (GWAS) were performed for 9 phenotypes including C-reactive protein (CRP), Interleukin (IL)-6, IL-8, fibrinogen, body mass index (BMI), waist circumference (WC) in males and females, and waist to hip ratio (WHR) in males and females (separately). GWAS for WHR and WC were meta-analyzed to obtain sex-combined results. Pleiotropy assessment was completed using adaptive Sum of Powered Score (aSPU) test. Three genetic loci with evidence of pleiotropy on chromosome 3, 12 and 18 were fine-mapped to distinguish the set of likely vi causal variants. Causal mediation analysis was used to assess whether likely causal variants were independently associated with both inflammation and adiposity. At least 3 signals, on chromosomes 3, 12, and 12, were identified that suggested the presence of SNPs with strong pleiotropic p-values (\u3c 5 × 10−6 ). The fine-mapping of these three suspected pleiotropic regions distinguished 22 variants with posterior causality probabilities greater than 50%. The mediation analysis indicated that rs60505812, on chromosome 3, was independently associated with both an inflammatory marker (IL-6) and an adiposity measure (BMI). For the variant rs73093474, on chromosome 12, results indicated both a direct association with CRP and an indirect association (via WHR). The identification of likely pleiotropic variants indicated that 1) a considerable degree of overlapping genetic pleiotropy exists between adiposity and inflammation, and 2) evidence exists to support both the direct and indirect pleiotropy. The results showed the potential of these genetic variants to provide biological insight, intended to improve the cardiovascular health of the Hispanics, and by extension all populations

    Non-parametric machine learning for biological sequence data

    Get PDF
    In the past decade there has been a massive increase in the volume of biological sequence data, driven by massively parallel sequencing technologies. This has enabled data-driven statistical analyses using non-parametric predictive models (including those from machine learning) to complement more traditional, hypothesis-driven approaches. This thesis addresses several challenges that arise when applying non-parametric predictive models to biological sequence data. Some of these challenges arise due to the nature of the biological system of interest. For example, in the study of the human microbiome the phylogenetic relationships between microorganisms are often ignored in statistical analyses. This thesis outlines a novel approach to modelling phylogenetic similarity using string kernels and demonstrates its utility in the two-sample test and host-trait prediction. Other challenges arise from limitations in our understanding of the models themselves. For example, calculating variable importance (a key task in biomedical applications) is not possible for many models. This thesis describes a novel extension of an existing approach to compute importance scores for grouped variables in a Bayesian neural network. It also explores the behaviour of random forest classifiers when applied to microbial datasets, with a focus on the robustness of the biological findings under different modelling assumptions.Open Acces

    Disruption of neural crest enhancer landscapes as an etiological mechanism for human neurocristopathies

    Get PDF
    The embryonic development of the human facial features is a highly complex mechanism which requires very exact spatial and temporal regulation of gene expression during neural crest (NC) development. NC cells (NCC) are a transient embryonic cell type with wide differentiation potential that contributes to the formation and morphogenesis of multiple tissues and organs, including many parts of the face. Just like any other cell type, NCC possess a characteristic set of enhancers that, by controlling the expression of specific genes, define cellular identity. Impairment of this regulation can lead to craniofacial malformations, such as orofacial cleft (OFC), which are frequently referred to as neurocristopathies and that represent a heavy burden on both the affected individuals and society. Understanding how genetic or structural disruption of enhancer activity during NC development can lead to human neurocristopathies is the central goal of this work. In the long term, the gained knowledge should serve to enable early detection and show potential therapeutic approaches. Here we investigate the pathomechanism of both syndromic (i.e. Branchiooculofacial Syndrome (BOFS)) and non-syndromic (i.e. OFC) neurocristopathies, by combining in vitro and in vivo NC developmental models with genetic engineering approaches and multiple genomic methods. First, we describe a unique patient with BOFS, who, in contrast to previously reported cases, does not present a heterozygous mutations within TFAP2A, a NC master regulator. Instead, the patient carries a de novo heterozygous 89 Mb inversion in which one of the breakpoints is located 40 kb downstream of TFAP2A. We first showed that this inversion separates TFAP2A from enhancers that are located within the same large topologically associating domain (TAD) and that are essential for TFAP2A expression in NCC. Importantly, using patient-specific human induced pluripotent stem cells (hiPSC) and a robust in vitro differentiation system towards NCC, we then showed that the inversion causes a loss of physical interactions between the inverted TFAP2A allele and its cognate enhancers, leading to TFAP2A monoallelic and haploinsufficient expression in human NCC. Overall, this first part provides a powerful approach to investigate the pathological mechanisms of structural variants predicted to disrupt 3D genome organization of gene regulatory landscapes and that, due to various reasons (i.e. limited access to relevant patient material, differences in gene dosage sensitivity between mice and humans, difficulties in recapitulating certain structural variants), cannot be properly evaluated in vivo. Second, we combined previously generated hNCC enhancer maps with OFC risk-loci identified through genome-wide association studies (GWAS) and, as a result, we revealed a highly conserved enhancer (i.e. Enh2p24.2) as a potential candidate harboring genetic variants involved in OFC. GWAS link common single nucleotide polymorphisms (SNPs) with quantitative traits and complex disorders. However, most disease-associated SNPs occur in non-coding regions of the human genome and consequently, the etiological relevance of these genetic variants cannot be easily connected to a gene. Nevertheless, accumulating evidences suggest that these disease-associated SNPs may contribute to human disease susceptibility by altering enhancers. Interestingly, SNPs associated with OFC are overrepresented in NCC enhancers. Therefore, we hypothesize that SNPs associated with OFC contribute to the etiology of the disorder by altering NCC enhancers and, consequently, the expression of relevant genes. Using Enh2p24.2 as a bait in circularized chromosome conformation capture sequencing (4C-seq) experiments, we identified two distally located genes, MYCN and DDX1, as its potential targets. Using in vitro and in vivo NCC developmental models, we then demonstrated that both genes are essential for normal facial development. While MYCN was not a surprising candidate to be involved in the etiology of OFC, the identification of DDX1 as a novel regulator of facial development might provide new insights into the molecular processes (e.g. transcription-coupled DNA repair) implicated in OFC and, potentially, other human neurocristopathies (e.g. neuroblastoma)
    corecore