2,538 research outputs found

    A broad overview of genotype imputation: Standard guidelines, approaches, and future investigations in genomic association studies

    Get PDF
    The advent of genomic big data and the statistical need for reaching significant results have led genome-wide association studies to be ravenous of a huge number of genetic markers scattered along the whole genome. Since its very beginning, the so-called genotype imputation served this purpose; this statistical and inferential procedure based on a known reference panel opened the theoretical possibility to extend association analyses to a greater number of polymorphic sites which have not been previously assayed by the used technology. In this review, we present a broad overview of the genotype imputation process, showing the most known methods and presenting the main areas of interest, with a closer look to the most up-to-date approaches and a deeper understanding of its usage in the present-day genomic landscape, shedding a light on its future developments and investigation areas

    Genomisk prediksjon ved bruk av høy tetthets- og hel-genom sekvens genotyper

    Get PDF
    The main objective of this thesis was to investigate genomic prediction methods for high-density and whole-genome sequence genotypes, with emphasis on traits that may have difficulties achieving a high prediction accuracy with pedigree-based predictions, such as disease resistance and maternal traits. A Bayesian variable selection method that combines a polygenic term through a G-matrix and a BayesC term (BayesGC) was compared with Genomic Best Linear Unbiased Prediction (GBLUP), and for Paper I and II, it was also compared to BayesC. Paper I aimed to investigate genomic prediction accuracy for the trait host resistance to salmon lice in Atlantic salmon (Salmo salar). Three genomic prediction methods (GBLUP, BayesC and BayesGC) were compared using 215K and 750K SNP genotypes through both within-family and across-family prediction scenarios. The data consisted of 1385 fish with both phenotype- and genotype, and the prediction accuracy was determined through five-fold cross-validation. The results showed an accuracy of ~0.6 and ~0.61 for across-family prediction with 215K and 750K genotypes and ~0.67 for within-family prediction for both genotypes. BayesGC showed a slightly higher prediction accuracy than GBLUP and BayesC, especially for the across-family predictions, but the differences were insignificant. Paper II investigated the prediction accuracy of GBLUP, BayesC and BayesGC for six maternal traits in Landrace sows. The data consisted of between 10,000 and 15,000 sows, all genotyped and imputed to a genotype density of 660K SNPs. The effects of different priors for the Bayesian variable selection methods were also investigated. The ~1,000 youngest sows were used as validation animals to validate the prediction accuracy. Results showed a variation in genomic prediction accuracy between 0.31 to 0.61 for the different traits. The accuracy did not vary much between the different methods and priors within traits. BayesGC had a 9.8 and 3% higher accuracy than GBLUP for traits M3W and BCS. However, for the other traits, there were minor differences. For within-breed prediction marker density and sizes of reference populations are often sufficient. However, when predicting across breeds, one might need a higher density, such as Whole Genome Sequence (WGS), or one could benefit from functional markers derived from WGS. Paper III investigates prediction accuracy for four maternal traits in two pig populations, a pure-bred Landrace (L) and a Synthetic (S) Yorkshire/Large White line. Prediction accuracy was tested with three different marker data sets: High-Density (HD), Whole Genome Sequence (WGS) and markers derived from WGS based on their pig Combined Annotation Dependent Depletion (pCADD) score. Two genomic prediction methods (GBLUP and BayesGC) were investigated for across- within- and multi-line predictions. For across- and within-line prediction, reference population sizes between 1K and 30K animals were analysed for prediction accuracy. In addition, multi-line reference population consisting of 1K, 3K or 6K animals for each line in different ratios were tested. The results showed that a reference population of 3K-6K animals for within-line prediction was usually sufficient to achieve a high prediction accuracy. However, increasing to 30K animals in the reference population further increased prediction accuracy for two of the traits. A reference population of 30K across-line animals achieved a similar accuracy to 1K within-line animals. For multi-line prediction, the accuracy was most dependent on the number of within-line animals in the reference data. The S-line provided a generally higher prediction accuracy than the L-line. Using pCADD scores to reduce the number of markers from WGS data in combination with the GBLUP method generally reduced prediction accuracies relative to GBLUP_HD analyses. When using BayesGC, prediction accuracies were generally similar when using HD, pCADD, or WGS marker data, suggesting that the Bayesian method selects a suitable set of markers irrespective of the markers provided (HD, pCADD, or WGS). Overall, these three studies showed that BayesGC seemed to have a slight advantage over GBLUP, especially with large datasets, high-density genotypes, and when relationships between the reference and validation animals were lower. They also showed that the relationship between the animals in the reference and validation population, and the size of the reference population, had a more significant impact on the prediction accuracy than the prediction method

    Computational and Statistical Approaches for Large-Scale Genome-Wide Association Studies

    Full text link
    Over the past decade, genome-wide association studies (GWAS) have proven successful at shedding light on the underlying genetic variations that affect the risk of human complex diseases, which can be translated to novel preventative and therapeutic strategies. My research aims at identifying novel disease-associated genetic variants through large-scale GWAS and developing computational and statistical pipelines and methods to improve power and accuracy of GWAS. Bicuspid aortic valve (BAV) is a congenital heart defect characterized by fusion of two of the normal three leaflets of the aortic valve. As the most common cardiovascular malformation in humans, BAV is moderately heritable and is an important risk factor for valvulopathy and aortopathy, but its genetic origins remain elusive. In Chapter 2, we present the first large-scale GWAS study to identify novel genetic variants associated with BAV. We report association with a non-coding variant 151kb from the gene encoding the cardiac-specific transcription factor, GATA4, and near-significance for p.Ser377Gly in GATA4. We used multiple bioinformatics approaches to demonstrate that the GATA4 gene is a plausible biological candidate. In the subsequent functional follow-up, GATA4 was interrupted by CRISPR-Cas9 in induced pluripotent stem cells from healthy donors. The disruption of GATA4 significantly impaired the transition from endothelial cells into mesenchymal cells, a critical step in heart valve development. Genotype imputation is widely used in GWAS to perform in silico genotyping, leading to higher power to identify novel genetic signals. When multiple reference panels are not consented to combine together, it is unclear how to combine the imputation results to optimize the power of genetic association tests. In Chapter 3, we compared the accuracy of 9,265 Norwegian genomes imputed from three reference panels – 1000 Genomes Phase 3 (1000G), Haplotype Reference Consortium (HRC), and a reference panel containing 2,201 Norwegian participants from the HUNT study with low-pass genome sequencing. We observed that the overall imputation accuracy from the population-specific panel was substantially higher than 1000G and was comparable with HRC, despite HRC being 15-fold larger. We also evaluated different strategies to utilize multiple sets of imputed genotypes to increase the power of association studies. We propose that testing association for all variants imputed from any panel results in higher power to detect association than the alternative strategy of testing only the version of each genetic variant with the highest imputation quality metric. In phenome-wide GWAS by large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly -- producing large type I error rates -- in the analysis of phenotypes with unbalanced case-control ratios. In Chapter 4, we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 white British European-ancestry samples for 1,403 dichotomous phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144097/1/zhowei_1.pd

    Enhancing genetic discoveries with population-specific reference panels

    Get PDF
    Met een aanpak die bekend staat als Genoom-breed associatieonderzoek (Genome-wide association study, GWAS) brak rond tien jaar geleden een nieuw tijdperk in genetica-onderzoek, waarbij licht werd geworpen op de complexe onderliggende factoren en aandoeningen van genetische componenten die voorheen grotendeels onbekend waren. Statistisch afgeleide methoden waren belangrijke ingrediënten voor succes, waarmee onderzoekers externe gegevens aan hun onderzoeken konden toevoegen en informatie konden maximaliseren zonder extra onderzoeksuitgaven. De technologie bleef zich ontwikkelen: terwijl initieel <1 miljoen punten van het DNA (genetische varianten) toegankelijk waren in een persoon, kan tegenwoordig het gehele genoom worden gekarakteriseerd (3 miljard punten) met next-generation sequentiemachines. De kosten voor sequentie zijn nog steeds onpraktisch voor GWAS, omdat er duizenden personen nodig zijn om reproduceerbare bevindingen te verzekeren. Volledige genomen kunnen echter worden afgeleid met statistische methoden, mits een gereduceerd aantal genetische varianten wordt gekarakteriseerd bij de onderzoeksvrijwilligers en een referentieset van onafhankelijke genomen beschikbaar is. Een internationale inspanning, het 1000 Genomes Project, genereerde openbare referentiesets door sequentie van ~2.500 vertegenwoordigers van de wereldpopulaties. In deze thesis evalueerden we de voordelen van een populatiespecifieke referentieset voor Sardijnen door 2.120 vrijwilligers te sequentiëren en deze vervolgens in GWAS te verwerken. We tonen aan hoe de nauwkeurigheid van afgeleide genomen verbeterd is in vergelijking met het gebruik van de 1000 Genomes-set en we identificeerden nieuwe genetische componenten voor verschillende complexe factoren die anders niet ontdekt hadden kunnen worden. Vergelijkbare inspanningen zijn gaande in andere populaties, waaronder de Nederlanders, en we bespreken in deze thesis het ontwerp en de resultaten daarvan.An approach known as Genome-wide association study (GWAS) have signed a new era in the Genetics research field around ten years ago, shedding light on the genetic components underlying complex traits and diseases, previously largely unknown. Statistical inferential methods were key ingredients for success, allowing researchers to incorporate external data in their studies, hence maximizing information at no additional experimental cost. Technology has continued to improve, and while initially <1 million points of the DNA (genetic variants) were assessable in a person, nowadays the entire genome (3 billion points) can be characterized with next-generation sequencing machines. The cost of sequencing is still impractical for GWASs, because several thousands of individuals are needed to assure reproducible findings. With statistical methods however, full genomes can be inferred if a reduced number of genetic variants is characterized on the study’s volunteers and a reference set of independent genomes is available. An international effort, the 1000 Genomes Project, has generated public reference sets by sequencing ~2500 representatives of the world’s populations. In this thesis, we evaluated the benefits of a population-specific reference set for Sardinians by sequencing 2,120 volunteers and subsequently incorporate it in GWASs. We show how the accuracy of inferred genomes is improved compared to using the 1000 Genomes set, and we identified novel genetic components for several complex traits that could not have been discovered otherwise. Similar efforts are ongoing in other populations, including the Dutch, and we discuss in this thesis their design and results

    Arvutuslikud ja statistilised meetodid DNA sekveneerimisandmete analüüsimiseks ja rakendused TÜ Eesti Geenivaramu andmetel

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneTänapäeval võimaldavad teise põlvkonna sekveneerimisel (next-generation sequencing, NGS) põhinevad meetodid määrata inimese genoomi järjestusi suurtes kohortides. Seejuures toodetakse väga suuri andmemahtusid, mis tekitavad mitmeid väljakutseid nii informaatika kui statistika valdkonnas. TÜ Eesti Geenivaramu (TÜ EGV) on aastatel 2002-2011 kogunud enam kui 50 000 inimese geeniproovi ja käesoleval aastal lisandub veel 100 000. Praeguseks hetkeks on üle 5 500 geenidoonori DNA-d analüüsitud erinevate NGS meetoditega. Käesolevas doktoritöös on pakutud üldine raamistik TÜ EGV-s toodetud NGS-andmete töötluseks ning lisaks on uuritud, kuidas võimalikult hästi arvestada Eesti päritolu isikute geneetilist eripära. Üheks levinud NGS meetodiks on eksoomi ehk kõigi valku kodeerivate geenipiirkondade sekveneerimine, mis võimaldab efektiivselt leida harvu ja de novo geenivariante ja leiab seetõttu rakendust meditsiinigeneetikas mendeliaarsete haiguste geenimutatsioonide tuvastamisel. Doktoritöö esimeses osas on analüüsitud kolme Eesti perekonna andmeid ja kõigil kolmel juhul kindlaks tehtud potentsiaalne patogeenne mutatsioon, mis lubab tulevikus välja töötada paremaid ravimeetodeid. Samuti on läbi viidud genoomi sekveneerimisandmete analüüs kliinilise vere näitajatega. See analüüs tõi välja populatsioonipõhise biopanga eelised, mis lisaks rikkalikele genoomiandmetele sisaldab ka väärtuslikku informatsiooni erinevate haiguste ja tunnuste kohta. Uuringus tuvastati olulisi seoseid CEBPA geenivariantide ja basofiilide arvu vahel, kusjuures viimasel on roll mitmete autoimmuunhaiguste sümptomaatikas. Ülegenoomsete assotsiatsiooniuuringute võimsuse suurendamiseks kasutatakse puuduvate geenivariantide ennustamist ehk imputeerimist. Muutmaks just Eesti päritolu isikute andmeanalüüsi tõhusamaks, on kasutatud genoomi sekveneerimisandmeid eestlaste-spetsiifilise imputatsioonipaneeli loomiseks. Seejärel on imputeeritud puuduvaid geenivariante kolmel moel – kasutades nii eestlaste-spetsiifilist kui ka kahte multi-etnilist paneeli. Võrdlustulemused näitasid, et eestlaste-spetsiifilise paneeli kasutamisel õnnestub määrata rohkem parema kvaliteediga geenivariante ning loodud paneeli eelis tuleb eriti esile harvaesinevate variantide puhul.Next-generation sequencing (NGS) technology enables large-scale, routine sequencing in large cohorts. This thesis demonstrated that the analysis of NGS data has a huge potential in several fields, but also requires a massive computational power. Also, with the increase of data volumes, there is an incessant need for the development of computational and statistical methods. Covering the whole spectrum of protein-coding regions in a cost-effective way, exome sequencing opens new opportunities for quick and exact large-scale screenings. In the first part of the thesis we analysed three Estonian families with Mendelian diseases and detected potentially causative gene variants for each case. These projects highlighted that a tight collaboration between data scientists and medical geneticists can lead to findings with considerable impact in the research of rare genetic disorders and have the potential to lead to successful therapies in the future. Population-based biobanks provide numerous opportunities for expanding phenotypic datasets. We used additional blood cell measurements from the electronic medical records and our genome-wide scan detected previously undiscovered association with basophil counts near CEBPA gene, and highlighted their role in the autoimmune regulation. This example opens new dimensions for scanning underlying genetic basis for a variety of traits and diseases. To increase the resolution of genome-wide scans, imputation is routinely implemented to incorporate variants that are not directly genotyped. We had an opportunity to construct an imputation reference panel to Estonians based on genome sequencing data. We showed that the utilization of a population-specific reference panel provided significantly higher imputation confidence for rare variants compared to larger, multi-ethnic panels. In the downstream analysis, we observed a huge gain in gene-based rare variant testing. As one of the main results of this thesis, the Estonian-specific imputation reference panel is created, tested and ready to serve for a long time. This includes data processing in the framework of the ongoing initiative to invite 100,000 Estonians to join the Biobank cohort, with the purpose to develop efficient disease prevention and treatment guides for the implementation of personalized medicine

    Enhancing genetic discoveries with population-specific reference panels

    Get PDF

    Enhancing genetic discoveries with population-specific reference panels

    Get PDF

    High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

    Get PDF
    The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies

    Sequence data and association statistics from 12,940 type 2 diabetes cases and controls

    Get PDF
    To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-frequency (minor allele frequency [MAF] 0.1–5%) non-coding variants in the whole-genome sequenced individuals and 99.7% of low-frequency coding variants in the whole-exome sequenced individuals. Each variant was tested for association with T2D in the sequenced individuals, and, to increase power, most were tested in larger numbers of individuals (\u3e80% of low-frequency coding variants in ~82 K Europeans via the exome chip, and ~90% of low-frequency non-coding variants in ~44 K Europeans via genotype imputation). The variants, genotypes, and association statistics from these analyses provide the largest reference to date of human genetic information relevant to T2D, for use in activities such as T2D-focused genotype imputation, functional characterization of variants or genes, and other novel analyses to detect associations between sequence variation and T2D
    corecore