329 research outputs found
Genomic diversity associated with polymorphic inversions in humans and their close relatives
Individuals of one species share the bulk of their genetic material, yet no two genomes are the same. Aside from displaying classical variation such as deletions, insertions, or substitutions of base pairs, two DNA segments can also differ in their orientation relative to the rest of their chromosomes. Such inversions are known for a range of biological implications and contribute critically to genome evolution and disease. However, inversions are notoriously challenging to detect, a fact which still impedes comprehensive analysis of their specific properties. This thesis describes several highly inter-connected projects aimed at identifying and functionally characterizing inversions present in the human population and related great ape species.
First, inversions between human and four great ape species were assessed for their potential to disrupt topologically associating domains (TADs), potentially prompting gene misregulation. TAD boundaries co-located with breakpoints of long inversions, and while disrupted TADs displayed elevated rates of differen- tially expressed genes, this effect could be attributed the vicinity to inversion breakpoints, suggesting overall robustness of gene expression in response to TAD disruption.
The second part of this thesis describes contributions to a collaborative project aimed at characterizing the full spectrum of inversions in 43 humans. In this study, I co-developed a novel inversion genotyping algorithm based on Strand- specific DNA sequencing and contributed to the description of 398 inversion polymorphisms. Inversions exhibited various underlying formation mechanisms, promotion of gene dysregulation, widespread recurrence, and association with genomic disease. These results suggest that long inversions are much more prominent in humans than previously thought, with at least 0.6% of the genome subject to inversion recurrence and, sometimes, the associated risk of subsequent deleterious mutation.
With a focus on the link between inversions and disease-causing copy num- ber variations, the last project describes a novel algorithm to identify loci hit sequentially by several overlapping mutation events. This algorithm enabled the description of detailed mutation sequences in 20 highly dynamic regions in the human genome, and additional complex variants on chromosome Y. Six complex loci associate directly with a genomic disease, thereby highlighting in detail the intrinsic link between inversions and CNVs. In summary, these projects provide novel insights into the landscape of in- versions in humans and primates, which are much more frequent, and often more complex than previously thought. These findings provide a basis for future inversion studies and highlight the crucial contribution of this class of mutation to genome variation
FPGAs in Bioinformatics: Implementation and Evaluation of Common Bioinformatics Algorithms in Reconfigurable Logic
Life. Much effort is taken to grant humanity a little insight in this fascinating and complex but fundamental topic. In order to understand the relations and to derive consequences humans have begun to sequence their genomes, i.e. to determine their DNA sequences to infer information, e.g. related to genetic diseases. The process of DNA sequencing as well as subsequent analysis presents a computational challenge for recent computing systems due to the large amounts of data alone. Runtimes of more than one day for analysis of simple datasets are common, even if the process is already run on a CPU cluster. This thesis shows how this general problem in the area of bioinformatics can be tackled with reconfigurable hardware, especially FPGAs. Three compute intensive problems are highlighted: sequence alignment, SNP interaction analysis and genotype imputation. In the area of sequence alignment the software BLASTp for protein database searches is exemplarily presented, implemented and evaluated.SNP interaction analysis is presented with three applications performing an exhaustive search for interactions including the corresponding statistical tests: BOOST, iLOCi and the mutual information measurement. All applications are implemented in FPGA-hardware and evaluated, resulting in an impressive speedup of more than in three orders of magnitude when compared to standard computers. The last topic of genotype imputation presents a two-step process composed of the phasing step and the actual imputation step. The focus lies on the phasing step which is targeted by the SHAPEIT2 application. SHAPEIT2 is discussed with its underlying mathematical methods in detail, and finally implemented and evaluated. A remarkable speedup of 46 is reached here as well
Recommended from our members
Computational methods for understanding genetic variations from next generation sequencing data
Studies of human genetic variation reveal critical information about genetic and complex diseases such as cancer, diabetes and heart disease, ultimately leading towards improvements in health and quality of life. Moreover, understanding genetic variations in viral population is of utmost importance to virologists and helps in search for vaccines. Next-generation sequencing technology is capable of acquiring massive amounts of data that can provide insight into the structure of diverse sets of genomic sequences. However, reconstructing heterogeneous sequences is computationally challenging due to the large dimension of the problem and limitations of the sequencing technology.This dissertation is focused on algorithms and analysis for two problems in which we seek to characterize genetic variations: (1) haplotype reconstruction for a single individual, so-called single individual haplotyping (SIH) or haplotype assembly problem, and (2) reconstruction of viral population, the so-called quasispecies reconstruction (QSR) problem. For the SIH problem, we have developed a method that relies on a probabilistic model of the data and employs the sequential Monte Carlo (SMC) algorithm to jointly determine type of variation (i.e., perform genotype calling) and assemble haplotypes. For the QSR problem, we have developed two algorithms. The first algorithm combines agglomerative hierarchical clustering and Bayesian inference to reconstruct quasispecies characterized by low diversity. The second algorithm utilizes tensor factorization framework with successive data removal to reconstruct quasispecies characterized by highly uneven frequencies of its components. Both algorithms outperform existing methods in both benchmarking tests and real data.Electrical and Computer Engineerin
Haplotype estimation in polyploids using DNA sequence data
Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p
Computational pan-genomics: status, promises and challenges
International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains
Recommended from our members
Efficient analysis and storage of large-scale genomic data
The impending advent of population-scaled sequencing cohorts involving tens of millions of individuals with matched phenotypic measurements will produce unprecedented volumes of genetic data. Storing and analysing such gargantuan datasets places computational performance at a pivotal position in medical genomics. In this thesis, I explore the potential for accelerating and parallelizing standard genetics workflows, file formats, and algorithms using both hardware-accelerated vectorization, parallel and distributed
algorithms, and heterogeneous computing.
First, I describe a novel bit-counting operation termed the positional population-count, which can be used together with succinct representations and standard efficient operations to accelerate many genetic calculations. In order to enable the use of this new operator and the canonical population count on any target machine I developed a unified low-level library using CPU dispatching to select the optimal method contingent on the available
instruction set architecture and the given input size at run-time. As a proof-of-principle application, I apply the positional population-count operator to computing quality control-related summary statistics for terabyte-scaled sequencing readsets with >3,800-fold speed improvements. As another application, I describe a framework for efficiently computing the cardinality of set intersection using these operators and applied this framework to efficiently compute genome-wide linkage-disequilibrium in datasets with up to 67 million samples resulting in up to >60-fold improvements in speed for dense genotypic vectors and up to >250,000-fold savings in memory and >100,000-fold improvement in speed for sparse genotypic vectors. I next describe a framework for handling the terabytes of compressed output data and describe graphical routines for visualizing long-range linkage-disequilibrium blocks as seen over many human centromeres. Finally, I describe efficient algorithms for storing and querying very large genetic datasets and specialized algorithms for the genotype component of such datasets with >10,000-fold savings in memory compared to the current interchange format.Wellcome Trus
FPGAs in der Bioinformatik: Implementierung und Evaluierung bekannter bioinformatischer Algorithmen in rekonfigurierbarer Logik
Life. Much effort is taken to grant humanity a little insight in this fascinating and complex but fundamental topic. In order to understand the relations and to derive consequences humans have begun to sequence their genomes, i.e. to determine their DNA sequences to infer information, e.g. related to genetic diseases. The process of DNA sequencing as well as subsequent analysis presents a computational challenge for recent computing systems due to the large amounts of data alone. Runtimes of more than one day for analysis of simple datasets are common, even if the process is already run on a CPU cluster. This thesis shows how this general problem in the area of bioinformatics can be tackled with reconfigurable hardware, especially FPGAs. Three compute intensive problems are highlighted: sequence alignment, SNP interaction analysis and genotype imputation. In the area of sequence alignment the software BLASTp for protein database searches is exemplarily presented, implemented and evaluated. SNP interaction analysis is presented with three applications performing an exhaustive search for interactions including the corresponding statistical tests: BOOST, iLOCi and the mutual information measurement. All applications are implemented in FPGA-hardware and evaluated, resulting in an impressive speedup of more than in three orders of magnitude when compared to standard computers. The last topic of genotype imputation presents a two-step process composed of the phasing step and the actual imputation step. The focus lies on the phasing step which is targeted by the SHAPEIT2 application. SHAPEIT2 is discussed with its underlying mathematical methods in detail, and finally implemented and evaluated. A remarkable speedup of 46 is reached here as well.Das Leben. Sehr viel Aufwand wird getrieben um der Menschheit einen Einblick in dieses faszinierende und komplexe, aber fundamentale Thema zu erlauben. Um Zusammenhänge zu verstehen und Folgen ableiten zu können hat der Mensch begonnen sein Genom zu sequenzieren, d.h. seine DNA zu bestimmen um daraus Informationen, z.B. in Bezug auf Erbkrankheiten folgern zu können. Der Prozess der DNA-Sequenzierung sowie die darauffolgenden Analysen sind schon allein wegen der riesigen Datenmengen eine Herausforderung für aktuelle Rechensysteme. Laufzeiten von über einen Tag für die Analyse einfacher Datensätze sind üblich, selbst wenn der Prozess bereits auf einem Computercluster ausgeführt wird. Diese Arbeit zeigt, wie dieses gängige Problem im Bereich der Bioinformatik mit rekonfigurierbarer Hardware, speziell FPGAs, angegangen werden kann. Es werden drei rechenintensive Themengebiete hervorgehoben: Sequenzalignment, SNP-Interaktionsanalyse und Genotyp-Imputation. Beispielhaft wird im Bereich des Sequenzalignments die Software BLASTp für die Suche in Proteinsequenzdatenbanken vorgestellt, implementiert und evaluiert. Die SNP-Interaktionsanalyse wird mit drei Verfahren zur vollständigen Suche von Interaktionen inklusive des dazugehörigen statistischen Tests vorgestellt: BOOST, iLOCi und die Messung der Transinformation. Alle Verfahren werden auf FPGA-Hardware implementiert und evaluiert, mit einer bestechenden Beschleunigung im dreistelligen Bereich gegenüber Standard-Rechnern. Das letzte Gebiet der Genotyp-Imputierung ist ein zweiteiliges Verfahren bestehend aus dem Phasing und der eigentlichen Imputation. Der Schwerpunkt liegt im Phasing-Schritt, der mit dem SHAPEIT2-Tool adressiert wird. SHAPEIT2 wird ausführlich mit den zugrunde liegenden mathematischen Methoden diskutiert, und schließlich implementiert und evaluiert. Auch hier wird ein beachtlicher Speedup von 46 erreicht
Complex genetic approaches to neurodegenerative diseases.
Neurodegenerative diseases are fatal disorders in which disease pathogenesis results in the progressive degeneration of the central and/or the peripheral nervous systems. These diseases currently affect -2% of the population but are expected to increase in prevalence as average life expectancy increases. The majority of these diseases have a complex genetic basis. The work presented in this thesis aimed to investigate the genetic basis of two neurodegenerative diseases, amyotrophic lateral sclerosis (ALS) and the human prion diseases kuru and sporadic Creutzfeldt-Jakob disease (sCJD), using novel complex genetic approaches. ALS is a fatal neurodegenerative disease in which motor neurons are seen to degenerate. It is a complex disease with 10% of individuals having a family history and the remaining 90% of non-familial cases having some genetic component. The gene DYNC1H1 is involved in retrograde axonal transport and is a good candidate for ALS. In this thesis the genetic architecture of DYNC1H1 was elucidated and a mutation screen of exons 8, 13 and 14 was undertaken in familial forms of ALS and other motor neuron diseases. No mutations were found. A linkage disequilibrium (LD) based association study was conducted using two tagging single nucleotide polymorphisms (tSNPs) which were identified as sufficient to represent genetic variation across DYNC1HI. These tSNPs were tested for an association with sporadic ALS (SALS) in 261 cases and 225 matched controls but no association was identified. Kuru is a devastating epidemic prion disease which affected a highly geographically restricted area of the Papua New Guinea highlands, predominantly affected adult women and children. Its incidence has steadily declined since the cessation of its route of transmission, endocannibalism, in the late 1950's. Kuru imposed strong balancing selection on codon 129 of the prion gene (PRNP). Analysis of kuru-exposed and unexposed populations showed significant deviations from Hardy-Weinberg equilibrium (HWE) consistent with the known protective effect of codon 129 heterozygosity. Signatures of selection were investigated in the surviving populations, such as deviations from HWE and an increasing cline in codon 129 valine allele frequency, which covaried with disease exposure. A novel PRNP G127V polymorphism was detected which, while common in the area of highest kuru incidence, was absent from kuru patients and unexposed population groups. Genealogical analysis revealed that the heterozygous PRNP G127V genotype confers strong prion disease resistance, which has been selected by the kuru epidemic. Finally, PRNP copy number was investigated as a possible genetic mechanism for susceptibility to kuru and sCJD. No conclusive copy number changes were identified
- …