81 research outputs found

    Haplotype estimation in polyploids using DNA sequence data

    Get PDF
    Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p

    Shotgun haplotyping: a novel method for surveying allelic sequence variation

    Get PDF
    Haplotypic sequences contain significantly more information than genotypes of genetic markers and are critical for studying disease association and genome evolution. Current methods for obtaining haplotypic sequences require the physical separation of alleles before sequencing, are time consuming and are not scaleable for large surveys of genetic variation. We have developed a novel method for acquiring haplotypic sequences from long PCR products using simple, high-throughput techniques. This method applies modified shotgun sequencing protocols to sequence both alleles concurrently, with read-pair information allowing the two alleles to be separated during sequence assembly. Although the haplotypic sequences can be assembled manually from the resultant data using pre-existing sequence assembly software, we have devised a novel heuristic algorithm to automate assembly and remove human error. We validated the approach on two long PCR products amplified from the human genome and confirmed the accuracy of our sequences against full-length clones of the same alleles. This method presents a simple high-throughput means to obtain full haplotypic sequences potentially up to 20 kb in length and is suitable for surveying genetic variation even in poorly-characterized genomes as it requires no prior information on sequence variation

    Whole Genome Amplification in Preimplantation Genetic Testing in the Era of Massively Parallel Sequencing

    Get PDF
    Publisher Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland.Successful whole genome amplification (WGA) is a cornerstone of contemporary preimplantation genetic testing (PGT). Choosing the most suitable WGA technique for PGT can be particularly challenging because each WGA technique performs differently in combination with different downstream processing and detection methods. The aim of this review is to provide insight into the performance and drawbacks of DOP-PCR, MDA and MALBAC, as well as the hybrid WGA techniques most widely used in PGT. As the field of PGT is moving towards a wide adaptation of comprehensive massively parallel sequencing (MPS)-based approaches, we especially focus our review on MPS parameters and detection opportunities of WGA-amplified material, i.e., mappability of reads, uniformity of coverage and its influence on copy number variation analysis, and genomic coverage and its influence on single nucleotide variation calling. The ability of MDA-based WGA solutions to better cover the targeted genome and the ability of PCR-based solutions to provide better uniformity of coverage are highlighted. While numerous comprehensive PGT solutions exploiting different WGA types and adjusted bioinformatic pipelines to detect copy number and single nucleotide changes are available, the ones exploiting MDA appear more advantageous. The opportunity to fully analyse the targeted genome is influenced by the MPS parameters themselves rather than the solely chosen WGA.publishersversionPeer reviewe

    Genome-wide Genotype Imputation-Aspects of Quality, Performance and Practical Implementation

    Get PDF
    Finding a relation between a particular phenotype and genotype is one of the central themes in medical genetics. Single-nucleotide polymorphisms are easily assessable markers allowing genome wide association (GWA) studies and meta-analysis. Hundreds of such analyses were performed in the last decades. Even though several tools for such analyses are available, an efficient SNP-data transformation tool was tool was necessary. We developed a data management tool fcGENE which allows us easy transformation of genetic data into different formats required by different GWA tools. Genotype imputation which is a common technique in GWA, allows us to study the relationship of a phenotype at markers that are missing and even at completely un-typed markers. Moreover this technique helps us to infer both common and rare variants that are not directly typed. We studied different aspects of the imputation processes especially focussing on its accuracy. More specifically, our focus lied on the impact of pre-imputation filtering on the accuracy of imputation results. To measure the imputation accuracy, we defined two new statistical sores, which allowed us the comparison between imputed and true genotypes directly. Our direct comparison between the true and imputed genotypes showed that strict quality filtering of SNPs prior to imputation process may be detrimental. We further studied the impact of differently selected reference panels from publicly available projects like HapMap and 1000 genome projects on the imputation quality. More specifically, we analysed the relationship between genetic distance of the reference and the resulting imputation quality. For this purpose, we considered different summary statistics of population differentiation (e.g. Reich’s , Nei’s and other modified scores) between the study data set and the reference panel used in imputation processes. In the third analysis, we compared two basic trends of using reference panels in imputation process: (1) use of genetically best-matched reference panel, and (2) use of an admixed reference panel that allows the use of individual reference panel from all possible type of populations, and let the software itself select the optimal references in a piece-wise manner or as complete sequences of SNPs for each individual separately. We have analysed in detail the performance of different imputation software and also the accuracy of the imputation processes in both cases. We found that the current trend of using software with admixed reference panel in all cases is not always the best strategy. Prior to imputation process, phasing of study data sets by using an external reference panel is also a common trend especially when it comes to the imputation of large datasets. We studied the performance of different imputation frameworks with or without pre-phasing. It turned out that pre-phasing clearly reduces the imputation quality for medium-sized data sets.:Table of Contents List of Tables IV List of Figures V 1 Overview of the Thesis 1 1.1 Abstract 1 1.2 Outlines 4 2 Introduction 5 2.1 Basics of genetics 5 2.1.1 Phenotype, genotype and haplotype 5 2.1.2 Hardy-Weinberg law 6 2.1.3 Linkage disequilibrium 6 2.1.4 Genome-wide association analysis 7 2.2 Phasing of Genotypes 7 2.3 Genotype imputation 8 2.3.1 Tools for Imputing genotype data 9 2.3.2 Reference panels 9 3 Results 11 3.1 Detailed Abstracts 11 3.1.1 First Research Paper 11 3.1.2 Second Research Paper 14 3.1.3 Third Research Paper 17 3.1.4 Fourth Research Paper 19 3.2 Discussion and Conclusion 22 4 Published Articles 27 4.1 First Research Paper 27 4.1.1 Supplementary Information 34 4.2 Second Research Paper 51 4.2.1 Supplementary Information 62 4.3 Third Research Paper 69 4.3.1 Supplementary Information 85 4.4 Fourth Research Paper 97 4.4.1 Supplementary Information 109 5 Zusammenfassung der Arbeit 117 6 Bibliography 120 7 Eigene Publikationen 124 8 Darstellung des eigenen Beitrags 125 8.1 First Research Paper 125 8.2 Second Research Paper 126 8.3 Third Research Paper 127 8.4 Fourth Research Paper 128 9 Erklärung über die eigenständige Abfassung der Arbeit 129 10 Danksagung 130 11 Curriculum Vitae 131 List of Tables IV List of Figures V 1 Overview of the Thesis 1 1.1 Abstract 1 1.2 Outlines 4 2 Introduction 5 2.1 Basics of genetics 5 2.1.1 Phenotype, genotype and haplotype 5 2.1.2 Hardy-Weinberg law 6 2.1.3 Linkage disequilibrium 6 2.1.4 Genome-wide association analysis 7 2.2 Phasing of Genotypes 7 2.3 Genotype imputation 8 2.3.1 Tools for Imputing genotype data 8 2.3.2 Reference panels 8 3 Results 8 3.1 Detailed Abstracts 8 3.1.1 First Research Paper 8 3.1.2 Second Research Paper 8 3.1.3 Third Research Paper 8 3.1.4 Fourth Research Paper 8 3.2 Discussion and Conclusion 8 4 Published Articles 8 4.1 First Research Paper 8 4.1.1 Supplementary Information 8 4.2 Second Research Paper 8 4.2.1 Supplementary Information 8 4.3 Third Research Paper 8 4.3.1 Supplementary Information 8 4.4 Fourth Research Paper 8 4.4.1 Supplementary Information 8 5 Zusammenfassung der Arbeit 8 6 Bibliography 8 7 Eigene Publikationen 8 8 Erklärung über die eigenständige Abfassung der Arbeit 8 9 Danksagung 8 10 Curriculum Vitae

    Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era

    Get PDF
    This thesis is motivated by two important processes in bioinformatics, namely variation calling and haplotyping. The contributions range from basic algorithms for sequence analysis, to the implementation of pipelines to deal with real data. Variation calling characterizes an individual's genome by identifying how it differs from a reference genome. It uses reads -- small DNA fragments -- extracted from a biological sample, and aligns them to the reference to identify the genetic variants present in the donor's genome. A related procedure is haplotype phasing. Sexual organisms have their genome organized in two sets of chromosomes, with equivalent functions. Each set is inherited from the mother and the father respectively, and its elements are called haplotypes. The haplotype phasing problem is, once genetic variants are discovered, to attribute them to either of the haplotypes. The first problem we consider is to efficiently index large collections of genomes. The Lempel-Ziv compression algorithms is a useful tool for this. We focus on two of its exponents, namely the RLZ and LZ77 algorithms. We analyze the first, and propose some modifications to both, to finally develop a scalable index for large and repetitive collections. Then, using that index, we propose a novel pipeline for variation calling to replace the single reference by thousands of them. We test our variation calling pipeline on a mutation-rich subsequence of a Finnish population genome. Our approach consistently outperforms the single-reference approach to variation calling. The second part of this thesis revolves around the haplotype phasing problem. First, we propose a generalization of sequence alignment for diploid genomes. Next we extend this model to offer a solution for the haplotype phasing problem in the family-trio setting (that is, when we know the variants present in an individual and in her parents). Finally, in the context of an existing read-based approach to haplotyping, we go back to basic algorithms, where we model the problem of pruning a set of reads aligned to a reference as an interval scheduling problem. We propose a exact solution that runs in subquadratic time and a 2-approximation algorithm that runs in linearithmic time.Motivaatio tähän tutkielmaan tulee kahdesta tärkeästä bioinformatiikan prosessista: geenimutaatioiden etsinnästä (variation calling) ja haplotyyppauksesta (haplotyping). Työssä edistetään sekvenssianalyysin algoritmiikkaa ja kehitetään työkaluja mittausdatan analysointiin. Geenimutaatioiden etsinnässä pyritään identifioimaan ne muutokset perimässä, jotka erottavat yksilön lajin referenssigenomista. Tähän tarkoitukseen käytetään perimää koodaavasta DNA-eristeestä luettuja lyhyitä sekvenssejä eli lukujaksoja (read sequences). Nämä lukujaksot linjataan referenssigenomiin, jolloin eroavuudet paljastavat yksilön geenimutaatiot. Hyvin samaan tapaan voidaan suorittaa haplotyyppausta: Suvullisesti lisääntyvillä eli diploidisilla organismeilla on perimä järjestynyt kahteen joukkoon kromosomeja, joissa vastinpareilla on sama funktio. Yksi kromosomijoukko peritään äidiltä ja toinen peritään isältä. Yksittäistä kromosomijoukkoa kutsutaan haplotyypiksi. Haplotyypin vaiheistus -ongelmassa (haplotype phasing problem) pyritään selvittämään löydetyille geenimutaatioille niiden oikea haplotyyppi. Ensimmäinen tutkielmassa tarkasteltu ongelma on suurten genomikokoelmien tehokas indeksointi. Lempel-Ziv tiivistysalgoritmit ovat hyödyllisiä tähän tarkoitukseen. Tutkielma keskittyy kahteen Lempel-Ziv algoritmien haaraan: RLZ ja LZ77 algoritmeihin. Ensimmäistä näistä analysoidaan, kumpaankin näistä esitetään muutoksia, ja lopputuloksena on skaalautuva indeksi suurille ja toisteisille kokoelmille. Kehitettyä indeksiä käytetään uuden geenimutaatioiden etsintään tarkoitetun työkalun komponenttina. Indeksi kykenee hyödyntämään tuhansia referenssigenomeita yhden sijaan. Työkalua testataan mutaatiorikkailla alueilla suomalaisen alipopulaation genomeista. Uusi lähestymistapa tuottaa systemaattisesti parempia tuloksia kuin aiempi yhteen referenssigenomiin perustuva lähestymistapa. Toinen osa tutkielmasta keskittyy haplotyyppaukseen. Aluksi sekvenssien linjauksesta esitetään yleistys diploidisille genomeille. Tämän jälkeen esitettyä mallia kehitetään ratkaisuksi haplotyypin vaiheistus -ongelmaan perhe-kolmikko-tapauksessa (missä geenimutaatiot on selvitetty yksilölle ja hänen vanhemmilleen). Lopuksi lukujaksoihin perustuvan haplotyyppien vaiheistus -ongelman tapauksessa palataan perusalgoritmiikkaan, kun johdetaan aikajanojen skedulointiongelmaan perustuva ratkaisu lukujaksojen suodatukseen; tutkielmassa esitetään tarkka polynomiaikainen ratkaisu ongelmaan sekä lähes lineaariaikainen 2-approksimaatioalgoritmi

    Why High-Performance Modelling and Simulation for Big Data Applications Matters

    Get PDF
    Modelling and Simulation (M&S) offer adequate abstractions to manage the complexity of analysing big data in scientific and engineering domains. Unfortunately, big data problems are often not easily amenable to efficient and effective use of High Performance Computing (HPC) facilities and technologies. Furthermore, M&S communities typically lack the detailed expertise required to exploit the full potential of HPC solutions while HPC specialists may not be fully aware of specific modelling and simulation requirements and applications. The COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications has created a strategic framework to foster interaction between M&S experts from various application domains on the one hand and HPC experts on the other hand to develop effective solutions for big data applications. One of the tangible outcomes of the COST Action is a collection of case studies from various computing domains. Each case study brought together both HPC and M&S experts, giving witness of the effective cross-pollination facilitated by the COST Action. In this introductory article we argue why joining forces between M&S and HPC communities is both timely in the big data era and crucial for success in many application domains. Moreover, we provide an overview on the state of the art in the various research areas concerned

    HAPRAP: a haplotype-based iterative method for statistical fine mapping using GWAS summary statistics

    Get PDF
    Motivation Fine mapping is a widely used approach for identifying the causal variant(s) at disease-associated loci. Standard methods (e.g. multiple regression) require individual level genotypes. Recent fine mapping methods using summary-level data require the pairwise correlation coefficients (r2 ) of the variants. However, haplotypes rather than pairwise r2 , are the true biological representation of linkage disequilibrium (LD) among multiple loci. In this article, we present an empirical iterative method, HAPlotype Regional Association analysis Program (HAPRAP), that enables fine mapping using summary statistics and haplotype information from an individual-level reference panel. Results Simulations with individual-level genotypes show that the results of HAPRAP and multiple regression are highly consistent. In simulation with summary-level data, we demonstrate that HAPRAP is less sensitive to poor LD estimates. In a parametric simulation using Genetic Investigation of ANthropometric Traits height data, HAPRAP performs well with a small training sample size (N < 2000) while other methods become suboptimal. Moreover, HAPRAP’s performance is not affected substantially by single nucleotide polymorphisms (SNPs) with low minor allele frequencies. We applied the method to existing quantitative trait and binary outcome meta-analyses (human height, QTc interval and gallbladder disease); all previous reported association signals were replicated and two additional variants were independently associated with human height. Due to the growing availability of summary level data, the value of HAPRAP is likely to increase markedly for future analyses (e.g. functional prediction and identification of instruments for Mendelian randomization)

    Simulation and graph mining tools for improving gene mapping efficiency

    Get PDF
    Gene mapping is a systematic search for genes that affect observable characteristics of an organism. In this thesis we offer computational tools to improve the efficiency of (disease) gene-mapping efforts. In the first part of the thesis we propose an efficient simulation procedure for generating realistic genetical data from isolated populations. Simulated data is useful for evaluating hypothesised gene-mapping study designs and computational analysis tools. As an example of such evaluation, we demonstrate how a population-based study design can be a powerful alternative to traditional family-based designs in association-based gene-mapping projects. In the second part of the thesis we consider a prioritisation of a (typically large) set of putative disease-associated genes acquired from an initial gene-mapping analysis. Prioritisation is necessary to be able to focus on the most promising candidates. We show how to harness the current biomedical knowledge for the prioritisation task by integrating various publicly available biological databases into a weighted biological graph. We then demonstrate how to find and evaluate connections between entities, such as genes and diseases, from this unified schema by graph mining techniques. Finally, in the last part of the thesis, we define the concept of reliable subgraph and the corresponding subgraph extraction problem. Reliable subgraphs concisely describe strong and independent connections between two given vertices in a random graph, and hence they are especially useful for visualising such connections. We propose novel algorithms for extracting reliable subgraphs from large random graphs. The efficiency and scalability of the proposed graph mining methods are backed by extensive experiments on real data. While our application focus is in genetics, the concepts and algorithms can be applied to other domains as well. We demonstrate this generality by considering coauthor graphs in addition to biological graphs in the experiments.Geenikartoitus on organismin havaittaviin piirteisiin vaikuttavien geenien järjestelmällistä etsintää perimästä. Väitöskirjassa esitetään uusia menetelmiä, joilla voidaan tehostaa sairauksille altistavien geenien kartoitusta. Väitöskirjan alussa tarkastellaan perimän simulointia (tyypillisesti maantieteellisesti) eristäytyneissä populaatioissa ja esitetään tarkoitukseen soveltuva uusi simulaattoriohjelmisto. Simuloidut aineistot ovat hyödyllisiä tutkimussuunnittelussa, jolloin niillä voidaan arvioida suunniteltujen aineistojen tilastollisia ominaisuuksia sekä käytettävien analysointimenetelmien toimintaa. Esimerkkinä tällaisesta tutkimuksesta työssä käydään läpi esitetyllä ohjelmistolla tehty laajahko simulaatiotutkimus. Tulosten perusteella väestöpohjainen tapaus-verrokkitutkimusasetelma vaikuttaa olevan tilastollisesti voimakas vaihtoehto kalliimmille perhe- ja sukupuupohjaisille asetelmille. Toinen osa väitöskirjaa käsittelee mahdollisesti sairauksille altistavien ns. ehdokasgeenien pisteytystä sen mukaan, kuinka vahvat yhteydet niillä on tutkittavaan sairauteen. Pisteytys on tärkeää, koska alustavat aineiston tarkastelut tuottavat tyypillisesti runsaasti ehdokasgeenejä, joiden kaikkien läpikäynti olisi liian työlästä. Pisteytyksellä jatkotutkimukset voidaan kohdistaa lupaavimpiin ehdokkaisiin. Työssä esitetään kuinka tällä hetkellä erillissä tietokannoissa oleva biologinen tieto voidaan esittää yhteinäisessä verkkomuodossa. Lisäksi näytetään kuinka tällaisesta aineistosta voidaan etsiä ehdokasgeenien ja tutkittavan sairauden välisiä yhteyksiä ja pisteyttää niitä verkonlouhinta-algoritmien avulla. Lopuksi työssä esitetään luotettavan aliverkon eristämisongelma ja algoritmeja sen ratkaisemiseen. Ongelmassa tavoitteena on poimia suuresta verkosta suhteellisen pieni aliverkko, joka sisältää vahvoja ja toisistaan riippumattomia yhteyksiä kahden annetun verkon solmun välillä. Siten luotettavat aliverkot soveltuvat erityisen hyvin löydettyjen yhteyksien kuvalliseen esittämiseen. Luotettavia aliverkkoja voidaan soveltaa perinnöllisyystieteen lisäksi myös muilla aloilla, kuten sosiaalisten verkkojen analyysissä
    corecore