45,581 research outputs found

    Identification of causal genes for complex traits.

    Get PDF
    MotivationAlthough genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations.ResultsIn this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability ρ. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2.Availability and implementationSoftware is freely available for download at genetics.cs.ucla.edu/caviar

    HaiguspÔhjuslike geenide tuvastamine statistiliste meetoditega

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneHaiguste mĂ”istmiseks ja ravimiseks on keskseks eelduseks pĂ”hjuslike, haigusprotsessides osalevate geenide vĂ€ljaselgitamine – selliste geenide poolt kodeeritud valkude tööd saab ravimite abil haigustele pĂ€rssivalt ĂŒmber korraldada. PĂ”hjuslike seoste leidmisel on peamiseks standardiks laboratoorsed katsed ja kontrollgrupiga kliinilised uuringud, kuid nende lĂ€biviimine on kulukas ja aeganĂ”udev. KĂ€esolevas doktoritöös nĂ€itame, et haigusi ja teisi kompleksseid fenotĂŒĂŒbilisi tunnuseid pĂ”hjuslikult mĂ”jutavaid geene saab mĂ€rksa efektiivsemalt tuvastada statistiliste meetoditega. Geneetikas on pĂ”hjuslik analĂŒĂŒs alles hiljuti hoo sisse saanud seoses rahvuslike biopankade poolt kogutud suurte andmemahtude rakendamisega. Valdkond on uudne ja suure potentsiaaliga, mistĂ”ttu on vastav matemaatiline teooria alles kujunemisjĂ€rgus ja kiiresti arenev. PĂŒhendame doktoritöös mĂ€rkimisvÀÀrset tĂ€helepanu nii selle teooria sĂŒstemaatilisele esitusele kui ka praktilistele edasiarendustele. PĂ”hjusliku statistilise analĂŒĂŒsi alusprintsiipe rakendades töötame vĂ€lja metoodika pĂ”hjuslike geenide tuvastamiseks vĂ€ikestest valimitest (n ≈ 500), informeerides pĂ”letikumarkeri C-reaktiivse valgu funktsiooni immuunvastuses. Domeeniteadmistele tuginedes loome pĂ”hjuslike mudelite eelduste suhtes robustse algoritmi, mis vĂ”imaldab mistahes haiguse vĂ”i komplekstunnuse toimemehhanismides olulist rolli omavaid geene avastada hĂŒpoteesivabalt ĂŒle terve genoomi. SĂŒvitsi vaatleme ĂŒhes haigustega seotud genoomipiirkonnas (16p11.2) leiduvate geenide mĂ”ju reproduktiivtervisele, osutades just funktsionaalselt olulistele geenidele. Personaalmeditsiini arenguid silmas pidades uurime ka pĂ”hjuslike geenide sĂ”ltuvust soost. Samuti hĂŒpotiseerime, kas populaarsed assotsiatsiooniuuringud geenide ja haiguste vahel tuvastavad pĂ”hjuslikke geene, haigustest tingitud muutusi geeniekspressioonis vĂ”i pelgalt juhuslikku mĂŒra. Peamised teadustöö tulemused verifitseerime laboris katseliselt.A prerequisite in understanding and curing disease is the identification of genes active in disease processes – drugs could be developed to target the proteins encoded by such causal genes. The main standard in discovering causal relationships between traits is provided by lab experiments and randomized clinical trials but these can be time-consuming and expensive to undertake. In this dissertation, we show that functionally relevant genes in the development of diseases and other complex traits can be more effectively identified using statistical methods. Causal statistical analysis in genetics has only recently been propelled by taking advantage of the vast amount of data collected by national biobanks. Due to the novelty and projected impact of the field, the corresponding mathematical theory is still evolving and rapidly so. We direct considerable attention to systematically introduce this theory and then further expand on it in practical applications. We apply the principles of causal analysis to develop methodology for identifying causal genes in small samples (n ≈ 500), ascertaining the function of an inflammatory biomarker C-reactive protein in immune response. By utilizing domain knowledge, we create an algorithm – robust to the assumptions of causal models – for hypothesis-free identification of causal genes to arbitrary complex traits over the entire genome. Furthermore, we take an in-depth look into a specific disease-associated genomic region (16p11.2) and are able to pinpoint genes responsible for reproductive health. With respect to the personalized medicine movement, we study whether the causal genes differ between sexes. Finally, we hypothesize whether the popular association studies between gene expression and complex traits identify causal genes, disease-induced changes in gene expression or simply random noise. We validate our primary research results with lab experiments.https://www.ester.ee/record=b541721

    Embracing polygenicity: a review of methods and tools for psychiatric genetics research.

    Get PDF
    The availability of genome-wide genetic data on hundreds of thousands of people has led to an equally rapid growth in methodologies available to analyse these data. While the motivation for undertaking genome-wide association studies (GWAS) is identification of genetic markers associated with complex traits, once generated these data can be used for many other analyses. GWAS have demonstrated that complex traits exhibit a highly polygenic genetic architecture, often with shared genetic risk factors across traits. New methods to analyse data from GWAS are increasingly being used to address a diverse set of questions about the aetiology of complex traits and diseases, including psychiatric disorders. Here, we give an overview of some of these methods and present examples of how they have contributed to our understanding of psychiatric disorders. We consider: (i) estimation of the extent of genetic influence on traits, (ii) uncovering of shared genetic control between traits, (iii) predictions of genetic risk for individuals, (iv) uncovering of causal relationships between traits, (v) identifying causal single-nucleotide polymorphisms and genes or (vi) the detection of genetic heterogeneity. This classification helps organise the large number of recently developed methods, although some could be placed in more than one category. While some methods require GWAS data on individual people, others simply use GWAS summary statistics data, allowing novel well-powered analyses to be conducted at a low computational burden

    Moving toward a system genetics view of disease

    Get PDF
    Testing hundreds of thousands of DNA markers in human, mouse, and other species for association to complex traits like disease is now a reality. However, information on how variations in DNA impact complex physiologic processes flows through transcriptional and other molecular networks. In other words, DNA variations impact complex diseases through the perturbations they cause to transcriptional and other biological networks, and these molecular phenotypes are intermediate to clinically defined disease. Because it is also now possible to monitor transcript levels in a comprehensive fashion, integrating DNA variation, transcription, and phenotypic data has the potential to enhance identification of the associations between DNA variation and diseases like obesity and diabetes, as well as characterize those parts of the molecular networks that drive these diseases. Toward that end, we review methods for integrating expression quantitative trait loci (eQTLs), gene expression, and clinical data to infer causal relationships among gene expression traits and between expression and clinical traits. We further describe methods to integrate these data in a more comprehensive manner by constructing coexpression gene networks that leverage pairwise gene interaction data to represent more general relationships. To infer gene networks that capture causal information, we describe a Bayesian algorithm that further integrates eQTLs, expression, and clinical phenotype data to reconstruct whole-gene networks capable of representing causal relationships among genes and traits in the network. These emerging network approaches, aimed at processing high-dimensional biological data by integrating data from multiple sources, represent some of the first steps in statistical genetics to identify multiple genetic perturbations that alter the states of molecular networks and that in turn push systems into disease states. Evolving statistical procedures that operate on networks will be critical to extracting information related to complex phenotypes like disease, as research goes beyond a single-gene focus. The early successes achieved with the methods described herein suggest that these more integrative genomics approaches to dissecting disease traits will significantly enhance the identification of key drivers of disease beyond what could be achieved by genetic association studies alone

    Statistical Methods for Identifying Genetic Risk Factors of Lung Diseases

    Get PDF
    Great efforts have been made to understand the mechanism of complex diseases. Besides studying environmental factors and lifestyles, it is imperative to find disease causal genes. With the development of sequencing technology and the rapid accumulation of diverse types of high-throughput biological data, a promising direction to identify disease genes is through data integration. Genes affect diseases through different biological activities. Integrating different biological data can improve the power of gene discovery and the understanding of pathogenic mechanisms. In recent years, a great number of omics databases have become available. Methods for multi-omics data integration have successfully improved the statistical power of gene identification and mapping genetic risk factors to specific cell types or epigenomic functions. Many traits have been found to share common genetic factors. Researchers have discovered the complex relationship between multiple traits and multiple genes. With increased large biobank studies and genome-wide association studies (GWAS), many multi-trait modeling methods have been proposed to test the existence and quantify the shared genetic factors between diseases and improve the statistical power of GWAS. In this dissertation, I have proposed new statistical models and methods for gene identification through data integration. I aimed to combine different biological data and integrate information shared between traits. To be more specific, in the first chapter, a comprehensive and powerful pipeline for integrative data analysis was proposed to identify idiopathic pulmonary fibrosis (IPF) associated genes. By integrating GWAS with transcriptome data and leveraging shared genetic factors between traits, 24 novel genes were identified for IPF susceptibility, which has expanded the understanding of the complex genetic architecture of IPF. In the second chapter, based on the success of multi-trait analysis in the first chapter, I proposed a novel statistical model called MAGAL: Multi-trait analysis of GWAS summary statistics using local genetic correlation. The goal was to leverage the local pleiotropic effect to increase the statistical power of identifying trait-gene associations and dissect disease heterogeneity using shared genetic factors across traits. MAGAL identified 144 candidate genes associated with chronic obstructive pulmonary disease (COPD) and showed improved power compared to previous methods. Integrative analysis of lung eQTL, bulk, and single-cell expression data prioritized 22 genes and suggested novel disease-related pathways. Genetic risk scores constructed by shared genetic factors between COPD and eosinophil percentage identified subgroups with heterogeneous phenotype characteristics and indicated new COPD subtypes. In the third chapter, based on the success of combining GWAS and transcriptomics data in the first chapter, I proposed a new statistical framework called INSECT: Integrative analysis of exomics and single-cell transcriptomics for gene prioritization. The aim was to prioritize disease causal genes using exome-wide association study and single-cell expression data. Thirty-one variants in coding regions and one gene were found to be significantly associated with COPD. INSECT identified five significantly associated cell types and prioritized 1047 genes with the highest probability of being disease causal genes. Our results highlight the importance of mitochondrial dysfunction in COPD and the shared mechanisms between COPD and cancers

    Novel integrative genomics strategies to identify genes for complex traits

    Get PDF
    Forward genetics is a common approach to dissecting complex traits like common human diseases. The ultimate aim of this approach was the identification of genes that are causal for disease or other phenotypes of interest. However, the forward genetics approach is by definition restricted to the identification of genes that have incurred mutations over the course of evolution or that incurred mutations as a result of chemical mutagenesis, and that as a result lead to disease or to variations in other phenotypes of interest. Genes that harbour no such mutations, but that play key roles in parts of the biological network that lead to disease, are systematically missed by this class of approaches. Recently, a class of novel integrative genomics approaches has been devised to elucidate the complexity of common human diseases by intersecting genotypic, molecular profiling, and clinical data in segregating populations. These novel approaches take a more holistic view of biological systems and leverage the vast network of gene–gene interactions, in combination with DNA variation data, to establish causal relationships among molecular profiling traits and Fbetween molecular profiling and disease (or other classic phenotypes). A number of novel genes for disease phenotypes have been identified as a result of these approaches, highlighting the utility of integrating orthogonal sources of data to get at the underlying causes of disease

    Differentially expressed genes reflect disease-induced rather than disease-causing changes in the transcriptome.

    Get PDF
    Comparing transcript levels between healthy and diseased individuals allows the identification of differentially expressed genes, which may be causes, consequences or mere correlates of the disease under scrutiny. We propose a method to decompose the observational correlation between gene expression and phenotypes driven by confounders, forward- and reverse causal effects. The bi-directional causal effects between gene expression and complex traits are obtained by Mendelian Randomization integrating summary-level data from GWAS and whole-blood eQTLs. Applying this approach to complex traits reveals that forward effects have negligible contribution. For example, BMI- and triglycerides-gene expression correlation coefficients robustly correlate with trait-to-expression causal effects (r <sub>BMI </sub> = 0.11, P <sub>BMI </sub> = 2.0 × 10 <sup>-51</sup> and r <sub>TG </sub> = 0.13, P <sub>TG </sub> = 1.1 × 10 <sup>-68</sup> ), but not detectably with expression-to-trait effects. Our results demonstrate that studies comparing the transcriptome of diseased and healthy subjects are more prone to reveal disease-induced gene expression changes rather than disease causing ones

    Graph pangenome captures missing heritability and empowers tomato breeding

    Get PDF
    Missing heritability in genome-wide association studies defines a major problem in genetic analyses of complex biological traits(1,2). The solution to this problem is to identify all causal genetic variants and to measure their individual contributions(3,4). Here we report a graph pangenome of tomato constructed by precisely cataloguing more than 19 million variants from 838 genomes, including 32 new reference-level genome assemblies. This graph pangenome was used forgenome-wide association study analyses and heritability estimation of 20,323 gene-expression and metabolite traits. The average estimated trait heritability is 0.41 compared with 0.33 when using the single linear reference genome. This 24% increase in estimated heritability is largely due to resolving incomplete linkage disequilibrium through the inclusion of additional causal structural variants identified using the graph pangenome. Moreover, by resolving allelic and locus heterogeneity, structural variants improve the power to identify genetic factors underlying agronomically important traits leading to, for example, the identification of two new genes potentially contributing to soluble solid content. The newly identified structural variants will facilitate genetic improvement of tomato through both marker-assisted selection and genomic selection. Our study advances the understanding of the heritability of complex traits and demonstrates the power of the graph pangenome in crop breeding

    Combining genome-wide association mapping and transcriptional networks to identify novel genes controlling glucosinolates in Arabidopsis thaliana.

    Get PDF
    BackgroundGenome-wide association (GWA) is gaining popularity as a means to study the architecture of complex quantitative traits, partially due to the improvement of high-throughput low-cost genotyping and phenotyping technologies. Glucosinolate (GSL) secondary metabolites within Arabidopsis spp. can serve as a model system to understand the genomic architecture of adaptive quantitative traits. GSL are key anti-herbivory defenses that impart adaptive advantages within field trials. While little is known about how variation in the external or internal environment of an organism may influence the efficiency of GWA, GSL variation is known to be highly dependent upon the external stresses and developmental processes of the plant lending it to be an excellent model for studying conditional GWA.Methodology/principal findingsTo understand how development and environment can influence GWA, we conducted a study using 96 Arabidopsis thaliana accessions, >40 GSL phenotypes across three conditions (one developmental comparison and one environmental comparison) and ∌230,000 SNPs. Developmental stage had dramatic effects on the outcome of GWA, with each stage identifying different loci associated with GSL traits. Further, while the molecular bases of numerous quantitative trait loci (QTL) controlling GSL traits have been identified, there is currently no estimate of how many additional genes may control natural variation in these traits. We developed a novel co-expression network approach to prioritize the thousands of GWA candidates and successfully validated a large number of these genes as influencing GSL accumulation within A. thaliana using single gene isogenic lines.Conclusions/significanceTogether, these results suggest that complex traits imparting environmentally contingent adaptive advantages are likely influenced by up to thousands of loci that are sensitive to fluctuations in the environment or developmental state of the organism. Additionally, while GWA is highly conditional upon genetics, the use of additional genomic information can rapidly identify causal loci en masse
    • 

    corecore