329 research outputs found

    Genomic diversity associated with polymorphic inversions in humans and their close relatives

    Get PDF
    Individuals of one species share the bulk of their genetic material, yet no two genomes are the same. Aside from displaying classical variation such as deletions, insertions, or substitutions of base pairs, two DNA segments can also differ in their orientation relative to the rest of their chromosomes. Such inversions are known for a range of biological implications and contribute critically to genome evolution and disease. However, inversions are notoriously challenging to detect, a fact which still impedes comprehensive analysis of their specific properties. This thesis describes several highly inter-connected projects aimed at identifying and functionally characterizing inversions present in the human population and related great ape species. First, inversions between human and four great ape species were assessed for their potential to disrupt topologically associating domains (TADs), potentially prompting gene misregulation. TAD boundaries co-located with breakpoints of long inversions, and while disrupted TADs displayed elevated rates of differen- tially expressed genes, this effect could be attributed the vicinity to inversion breakpoints, suggesting overall robustness of gene expression in response to TAD disruption. The second part of this thesis describes contributions to a collaborative project aimed at characterizing the full spectrum of inversions in 43 humans. In this study, I co-developed a novel inversion genotyping algorithm based on Strand- specific DNA sequencing and contributed to the description of 398 inversion polymorphisms. Inversions exhibited various underlying formation mechanisms, promotion of gene dysregulation, widespread recurrence, and association with genomic disease. These results suggest that long inversions are much more prominent in humans than previously thought, with at least 0.6% of the genome subject to inversion recurrence and, sometimes, the associated risk of subsequent deleterious mutation. With a focus on the link between inversions and disease-causing copy num- ber variations, the last project describes a novel algorithm to identify loci hit sequentially by several overlapping mutation events. This algorithm enabled the description of detailed mutation sequences in 20 highly dynamic regions in the human genome, and additional complex variants on chromosome Y. Six complex loci associate directly with a genomic disease, thereby highlighting in detail the intrinsic link between inversions and CNVs. In summary, these projects provide novel insights into the landscape of in- versions in humans and primates, which are much more frequent, and often more complex than previously thought. These findings provide a basis for future inversion studies and highlight the crucial contribution of this class of mutation to genome variation

    FPGAs in Bioinformatics: Implementation and Evaluation of Common Bioinformatics Algorithms in Reconfigurable Logic

    Get PDF
    Life. Much effort is taken to grant humanity a little insight in this fascinating and complex but fundamental topic. In order to understand the relations and to derive consequences humans have begun to sequence their genomes, i.e. to determine their DNA sequences to infer information, e.g. related to genetic diseases. The process of DNA sequencing as well as subsequent analysis presents a computational challenge for recent computing systems due to the large amounts of data alone. Runtimes of more than one day for analysis of simple datasets are common, even if the process is already run on a CPU cluster. This thesis shows how this general problem in the area of bioinformatics can be tackled with reconfigurable hardware, especially FPGAs. Three compute intensive problems are highlighted: sequence alignment, SNP interaction analysis and genotype imputation. In the area of sequence alignment the software BLASTp for protein database searches is exemplarily presented, implemented and evaluated.SNP interaction analysis is presented with three applications performing an exhaustive search for interactions including the corresponding statistical tests: BOOST, iLOCi and the mutual information measurement. All applications are implemented in FPGA-hardware and evaluated, resulting in an impressive speedup of more than in three orders of magnitude when compared to standard computers. The last topic of genotype imputation presents a two-step process composed of the phasing step and the actual imputation step. The focus lies on the phasing step which is targeted by the SHAPEIT2 application. SHAPEIT2 is discussed with its underlying mathematical methods in detail, and finally implemented and evaluated. A remarkable speedup of 46 is reached here as well

    Theory and Algorithms for the Haplotype Assembly Problem

    Full text link

    Haplotype estimation in polyploids using DNA sequence data

    Get PDF
    Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p

    Computational pan-genomics: status, promises and challenges

    Get PDF
    International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

    FPGAs in der Bioinformatik: Implementierung und Evaluierung bekannter bioinformatischer Algorithmen in rekonfigurierbarer Logik

    Get PDF
    Life. Much effort is taken to grant humanity a little insight in this fascinating and complex but fundamental topic. In order to understand the relations and to derive consequences humans have begun to sequence their genomes, i.e. to determine their DNA sequences to infer information, e.g. related to genetic diseases. The process of DNA sequencing as well as subsequent analysis presents a computational challenge for recent computing systems due to the large amounts of data alone. Runtimes of more than one day for analysis of simple datasets are common, even if the process is already run on a CPU cluster. This thesis shows how this general problem in the area of bioinformatics can be tackled with reconfigurable hardware, especially FPGAs. Three compute intensive problems are highlighted: sequence alignment, SNP interaction analysis and genotype imputation. In the area of sequence alignment the software BLASTp for protein database searches is exemplarily presented, implemented and evaluated. SNP interaction analysis is presented with three applications performing an exhaustive search for interactions including the corresponding statistical tests: BOOST, iLOCi and the mutual information measurement. All applications are implemented in FPGA-hardware and evaluated, resulting in an impressive speedup of more than in three orders of magnitude when compared to standard computers. The last topic of genotype imputation presents a two-step process composed of the phasing step and the actual imputation step. The focus lies on the phasing step which is targeted by the SHAPEIT2 application. SHAPEIT2 is discussed with its underlying mathematical methods in detail, and finally implemented and evaluated. A remarkable speedup of 46 is reached here as well.Das Leben. Sehr viel Aufwand wird getrieben um der Menschheit einen Einblick in dieses faszinierende und komplexe, aber fundamentale Thema zu erlauben. Um Zusammenhänge zu verstehen und Folgen ableiten zu können hat der Mensch begonnen sein Genom zu sequenzieren, d.h. seine DNA zu bestimmen um daraus Informationen, z.B. in Bezug auf Erbkrankheiten folgern zu können. Der Prozess der DNA-Sequenzierung sowie die darauffolgenden Analysen sind schon allein wegen der riesigen Datenmengen eine Herausforderung für aktuelle Rechensysteme. Laufzeiten von über einen Tag für die Analyse einfacher Datensätze sind üblich, selbst wenn der Prozess bereits auf einem Computercluster ausgeführt wird. Diese Arbeit zeigt, wie dieses gängige Problem im Bereich der Bioinformatik mit rekonfigurierbarer Hardware, speziell FPGAs, angegangen werden kann. Es werden drei rechenintensive Themengebiete hervorgehoben: Sequenzalignment, SNP-Interaktionsanalyse und Genotyp-Imputation. Beispielhaft wird im Bereich des Sequenzalignments die Software BLASTp für die Suche in Proteinsequenzdatenbanken vorgestellt, implementiert und evaluiert. Die SNP-Interaktionsanalyse wird mit drei Verfahren zur vollständigen Suche von Interaktionen inklusive des dazugehörigen statistischen Tests vorgestellt: BOOST, iLOCi und die Messung der Transinformation. Alle Verfahren werden auf FPGA-Hardware implementiert und evaluiert, mit einer bestechenden Beschleunigung im dreistelligen Bereich gegenüber Standard-Rechnern. Das letzte Gebiet der Genotyp-Imputierung ist ein zweiteiliges Verfahren bestehend aus dem Phasing und der eigentlichen Imputation. Der Schwerpunkt liegt im Phasing-Schritt, der mit dem SHAPEIT2-Tool adressiert wird. SHAPEIT2 wird ausführlich mit den zugrunde liegenden mathematischen Methoden diskutiert, und schließlich implementiert und evaluiert. Auch hier wird ein beachtlicher Speedup von 46 erreicht

    Complex genetic approaches to neurodegenerative diseases.

    Get PDF
    Neurodegenerative diseases are fatal disorders in which disease pathogenesis results in the progressive degeneration of the central and/or the peripheral nervous systems. These diseases currently affect -2% of the population but are expected to increase in prevalence as average life expectancy increases. The majority of these diseases have a complex genetic basis. The work presented in this thesis aimed to investigate the genetic basis of two neurodegenerative diseases, amyotrophic lateral sclerosis (ALS) and the human prion diseases kuru and sporadic Creutzfeldt-Jakob disease (sCJD), using novel complex genetic approaches. ALS is a fatal neurodegenerative disease in which motor neurons are seen to degenerate. It is a complex disease with 10% of individuals having a family history and the remaining 90% of non-familial cases having some genetic component. The gene DYNC1H1 is involved in retrograde axonal transport and is a good candidate for ALS. In this thesis the genetic architecture of DYNC1H1 was elucidated and a mutation screen of exons 8, 13 and 14 was undertaken in familial forms of ALS and other motor neuron diseases. No mutations were found. A linkage disequilibrium (LD) based association study was conducted using two tagging single nucleotide polymorphisms (tSNPs) which were identified as sufficient to represent genetic variation across DYNC1HI. These tSNPs were tested for an association with sporadic ALS (SALS) in 261 cases and 225 matched controls but no association was identified. Kuru is a devastating epidemic prion disease which affected a highly geographically restricted area of the Papua New Guinea highlands, predominantly affected adult women and children. Its incidence has steadily declined since the cessation of its route of transmission, endocannibalism, in the late 1950's. Kuru imposed strong balancing selection on codon 129 of the prion gene (PRNP). Analysis of kuru-exposed and unexposed populations showed significant deviations from Hardy-Weinberg equilibrium (HWE) consistent with the known protective effect of codon 129 heterozygosity. Signatures of selection were investigated in the surviving populations, such as deviations from HWE and an increasing cline in codon 129 valine allele frequency, which covaried with disease exposure. A novel PRNP G127V polymorphism was detected which, while common in the area of highest kuru incidence, was absent from kuru patients and unexposed population groups. Genealogical analysis revealed that the heterozygous PRNP G127V genotype confers strong prion disease resistance, which has been selected by the kuru epidemic. Finally, PRNP copy number was investigated as a possible genetic mechanism for susceptibility to kuru and sCJD. No conclusive copy number changes were identified
    corecore