213 research outputs found

    Sequence Searching Allowing for Non-Overlapping Adjacent Unbalanced Translocations

    Get PDF

    Comprehensive identification and characterisation of germline structural variation within the Iberian population

    Get PDF
    [eng] One of the central aims of biology and biomedicine has been the characterisation and understanding of genetic variation across humans, to answer important evolutionary questions and to explain phenotypic variability concerning the diseases. Understanding genetic variability, is key to study this relationship (through imputation and GWASs) and to translate the results into improved clinical protocols. Different initiatives have emerged around the world to systematically characterise the genetic variability of specific human populations from whole-genome sequences, usually by selecting geographical regions. Examples such as 1000 Genomes (1000G)1, GoNL2, HRC, UK10K3 or Estonian population4, have already identified and characterised millions of genetic variants across different populations. In combination with imputation analysis, these sequenced-based projects allow increasing the statistical power and resolution of Genome-Wide Association Studies (GWAS), identifying and discovering new disease-associated variants5. Additionally, genetic variability among population groups is associated with geographic ancestry and can affect the disease risk or treatment efficacy differently6,7. For this reason, population- specific reference panels are necessary to characterise their genetic diversity and to assess its effect on human phenotypes, improving GWAS studies, as one of the cornerstones of precision medicine7. Existing genetic variability panels include Single Nucleotide Variants (SNVs) and indels (<50bp) but are limited in large Structural Variants (SV) (≄50bp). Technical and methodological limitations hindered the discovery of SVs using Next-generation Sequencing (NGS) technologies, as it produced False-Discovery Rates between 9-89% and recall 10-70%, depending on the SV type and size8. On average, the genomic variation between two human genomes is around 0.1%, but this difference increases to 1.5% with SVs8. The SVs also affect 3-10 times more nucleotides than SNVs9 (4M SNVs per genome10), showing their potential effect on human phenotypes. For this reason, including a complete catalogue of SVs in reference panels will increase the power in GWAS studies and provide opportunities to find new disease-associated variants. To overcome these limitations, in this thesis, we have generated the first genome-wide Iberian haplotype reference panel, mainly focused on Structural Variants, using 785 samples whole-genome sequenced (WGS) at high coverage (30X) from the GCAT-Genomics for life project. We designed a complete strategy, including an extensive benchmarking of multiple variant calling programs and by building specific Logistic Regression Models (LRM) for SV types, as well as phasing strategies to come up with a high quality and comprehensive genetic variability panel. This strategy was benchmarked using different controlled sets of variants, showing high precision and recall values across all variant types and sizes. The application of this strategy to our GCAT whole-genome samples resulted in the identification of 35,431,441 genetic variants, classified as 30,325,064 SNPs, 5,017,19 small indels (< 50bp), and 89,178 larger SV (≄ 50bp). The latter group was further subclassified into 33,244 deletions, 6,269 duplications, 12,782 insertions, 10,115 inversions, 18,779 transposons and 7,989 translocations, covering all ranges of frequencies and sizes. Besides, 60% of the discovered SVs were not catalogued in any repository, thus increasing the insights of SV in humans. Additionally, 52.44% of common and 71.63% of low-frequency SVs were not included in any haplotype reference panel. Thus, new SVs could be used in GWAS, adding more value to the Iberian-GCAT catalogue. The prediction of the functional impact of the SVs shows that these variants might have a central role in several diseases. Of all SVs included in the Iberian-GCAT catalogue, 46% overlapped in genes (both protein-coding genes and non-protein-coding genes), highlighting their potential impact on human traits. Besides, 92.7% of protein-coding genes were located outside low-complexity (repeated) genomic regions, expecting short-reads from NGS to capture the most interpretable SVs in humans11. Moreover, 32.93% of SVs affected protein-coding genes with a predicted loss of function intolerance (pLI) effect, further supporting the potential implication of these variants on complex diseases and therefore enabling a better explanation of missing heritability. Importantly, taking advantage of high coverage (30X), we accurately determine the genotypes of SVs, enabling to phase together with SNVs and indels, and increasing the SV phasing accuracy, in contrast to 1000G and GoNL. Besides, high coverage allowed to use Phasing Informative Reads (PIRs), increasing the phasing performance. The overall strategy enables the community to expand and improve the imputation possibilities within GWAS. The Iberian-GCAT haplotype reference panel created in this thesis, imputes accurately common SVs, with near ~100% of agreement with sequencing results. Although the Iberian- GCAT haplotype reference panel can be used in all populations from different continental groups, due to closer ancestries, the imputation performance is high in European and Latin American populations, reflected in the amount of low-frequency (1% ≀ MAF MAF) variants imputed at high info scores. These results demonstrated the versatility of our resource, increasing their performance in closer ancestries. In general, we observed that when the allele frequency decreases, the imputation accuracy drops too, highlighting the necessity to include more samples in reference panels, to impute low-frequency and rare variants efficiently, which normally are expected to have more functional impact on diseases. Finally, we compared the imputation possibilities of the 1000G and GoNL reference panels, with our Iberian-GCAT reference panel. We observed that the Iberian-GCAT reference panel outperformed the imputation of high-quality SVs by 2.7 and 1.6-fold compared to 1000G and GoNL, respectively. Also, the overall imputation quality is higher, showing the value of this new resource in GWAS as it includes more SVs than previous reference panels. The combination of different reference panels will improve the resolution and statistical power of GWAS, thus increasing the chances to find more risk variants in complex diseases, and ultimately, translated this insight to precision medicine

    A large-scale computational framework for comparative analyses in population genetics and metagenomics

    Get PDF
    Population genetics is the study of spatio-temporal genetic variants among individuals. Its purpose is to understand evolution: the change in frequency of alleles over time. The effects of these alleles are expressed on different levels of biological organization, from molecular com-plexes to entire organisms. Eventually, they will affect traits that can influence the survival and reproduction of organisms. Fitness is a probability of transferring alleles to subsequent genera-tions with respect to successful survival and reproduction. Due to differential fitness, any phe-notypic properties that confer beneficial effects on survival and reproduction may presumably become prevalent in a population. Random mutations introduce new alleles in a population. The underlying changes in DNA sequences can be caused by replication errors, failures in DNA repair processes, or insertion and deletion of transposable elements. For sexual organisms, genetic recombination randomly mixes up the alleles in chromosomes, in turn, yielding a new composition of alleles though it does not change the allele frequencies. On the molecular level, mutations on a set of loci may cause a gain or loss of function resulting in totally different phenotypes, hereby influencing the survival of an organism. Despite the dominance of neutral mutations, the accumulation of small changes over time may affect the fitness, and further contribute to evolution. The goal of this study is to provide a framework for a comparative analysis on large-scale genomic datasets, especially, of a population within a species such as the 1001 Genomes Project of Arabidopsis thaliana, the 1000 Genomes Project of humans, or metagenomics datasets. Algo-rithms have been developed to provide following features: 1) denoising and improving the ef-fective coverage of raw genomic datasets (Trowel), 2) performing multiple whole genome alignments (WGAs) and detecting small variations in a population (Kairos), 3) identifying struc-tural variants (SVs) (Apollo), and 4) classifying microorganisms in metagenomics datasets (Po-seidon). The algorithms do not furnish any interpretation of raw genomic data but provide anal-yses as basis for biological hypotheses. With the advances in distributed and parallel computing, many modern bioinformatics algo-rithms have come to utilize multi-core processing on CPUs or GPUs. Having increased computa-tional capacity allows us to solve bigger and more complex problems. However, such hardware advances do not spontaneously give rise to the improved utilization of large-size datasets and do not bring insights by themselves to biological questions. Smart data structures and algorithms are required in order to exploit the enhanced computing power and to extract high quality infor-mation. For population genetics, an efficient representation for a pan genome and relevant for-mulas should be manifested. On top of such representation, sequence alignments play pivotal roles in solving biological problems such that one may calculate allele frequencies, detect rare variants, associate genotypes to phenotypes, and infer causality of certain diseases. To detect mutations in a population, the conventional alignment method is enhanced as multiple genomes are simultaneously aligned. The number of complete genome sequences has steadily increased, but the analysis of large, complex datasets remains challenging. Next Generation Sequencing (NGS) technology is consid-ered one of the great advances in modern biology, and has led to a dramatically more precise and detailed understanding of genomes and their activities. The contiguity and accuracy of se-quencing reads have been improving so that a complete genome sequence of a single cell may become obtainable from a sequencing library in the future. Though chemical and optical engi-neering are main drivers to advance sequencing technology, informatics and computer engineer-ing have significantly influenced the quality of sequences. Genomic sequencing data contain errors in forms of substitution, insertion, and deletion of nucleotides. The read length is far shorter than a given genome. These problems can be alleviated by means of error corrections and genome assemblies, leading to more accurate downstream analyses. Short read aligners have been the key ingredient for measuring and observing genetic muta-tions using Illumina sequencing technology, the dominant technology in the last decade. As long reads from newer methods or assembled contigs become accessible, mapping schemes capturing long-range context, but not lingering in local matches should be devised. Parameters for short read aligners such as the number of mismatches, gap-opening and -extending penalty are not directly applicable to long read alignments. At the other end of the spectrum, whole genome aligners (WGA) attempt to solve the alignment problem in a much longer context, providing es-sential data for comparative studies. However, available WGA algorithms are not yet optimized concerning practical uses in population genetics due to high computing demands. Moreover, too little attention has been paid to define an ideal data format for applications in comparative ge-nomics. To deal with datasets representing a large population of diverse individuals, multiple se-quence alignment (MSA) algorithms should be combined with WGA methods, known as multi-ple whole genome alignment (MWGA). Though several MWGA algorithms have been proposed, the accuracy of algorithms has not been clearly measured. In fact, known quality assessment tools have yielded highly fluctuating results dependent on the selection of organisms, and se-quencing profiles. Of even more serious concern, experiments to measure the performance of MWGA methods have been only ambiguously described. In turn, it has been difficult to inter-pret the multiple alignment results. With known precise locations of variants from simulations and standardized statistics, I present a far more comprehensive method to measure the accuracy of a MWGA algorithm. Metagenomics is a study of the genetic composition in a given community (often, predomi-nantly microbial). It overcomes the limitation of having to culture each organism for genome sequencing and also provides quantitative information on the composition of a community. Though an environmental sample provides more natural genetic material, the complexity of analyses is greatly increased. The number of species can be very large and only small portions of a genome may be sampled. I provide an algorithm, Poseidon, classifying sequencing reads to taxonomy identifiers at a species resolution and helping to quantify their relative abundances in the samples. The interactions among individual bacteria in a certain population can result in both conflict and cooperation. Thus, a mixture of diverse bacteria species shows a set of functional adaptations to a particular environment. The composition of species would be changed by dis-tinct biotic or abiotic factors, which may lead to a successive alteration in susceptibility of a host to a certain disease. In turn, basic concerns for a metagenomics study are an accurate quantifica-tion of species and deciphering their functional role in a given environment. In summary, this work presents advanced bioinformatics methods: Trowel, Kairos, Apollo, and Poseidon. Trowel corrects sequencing errors in reads by utilizing a piece of high-quality k-mer information. Kairos aligns query sequences against multiple genomes in a population of a single species. Apollo characterizes genome-wide genetic variants from point mutations to large structural variants on top of the alignments of Kairos. Poseidon classifies metagenomics datasets to taxonomy identifiers. Though the work does not directly address any specific biological ques-tions, it would provide preliminary materials for further downstream analyses.In der Populationsgenetik werden die rĂ€umlichen und zeitlichen Verteilungen von genetischen Varianten in Individuen einer Population untersucht. Über die Generationen Ă€ndert sich die Frequenz von Genen und Allelen. Die Auswirkungen der durch diese evolutionĂ€ren Mechanismen gebildete DiversitĂ€t zeigt sich auf verschiedenen Stufen biologischer Organisation, von einzelnen MolekĂŒlen bis hin zu gesamten Organismen. Sind Eigenschaften betroffen welche einen Einfluss auf die Überlebens- und Reproduktionsrate haben, werden die zugrundeliegenden Allele mit höherer Wahrscheinlichkeit in die nachfolgende Generation ĂŒbetragen werden. Allele mit positiver Auswirkungen auf die Fitness eines Organismus könnten sich so in einer Population verbreiten. ZufĂ€llige Mutationen sind eine Quelle fĂŒr neue Allele in einer Population. Die zugrundeliegenden VerĂ€nderungen der DNA-Sequenzen können durch Fehler bei der DNA-Replikation oder von DNA-Reparaturmechanismen, sowie Insertionen und Deletionen von mobilen genetischen Elementen entstehen. In sich sexuell fortpflanzenden Organismen sorgt genetische Rekombination fĂŒr eine Vermischung der Allele auf den Chromosomen. Obwohl die Allelfrequenzen nicht verĂ€ndert werden, entstehen dadurch neue Kombinationen von Allelen. Auf der molekularen Ebene können Genloci durch Mutationen an AktivitĂ€t gewinnen oder funktionslos werden, was wiederum eine Auswirkung auf den entstehenden PhĂ€notyp und die ÜberlebensfĂ€higkeit des Organismus hat. Trotz der höherer Verbreitung neutraler Mutationen, kann das Ansammeln von kleinen VerĂ€nderungen im Laufe der Zeit die Fitness beeinflussen und weiter der Evolution beitragen. Das Ziel dieser Arbeit war es ein Rahmenwerk fĂŒr die vergleichende Analyse großer genomischer Datensets zur VerfĂŒgung zu stellen. Im Besonderen fĂŒr DatensĂ€tze mit vielen Individuen einer Spezies wie im 1001 Genomes Project (Arabidopsis thaliana), im 1000 Genomes Project (Homo sapiens) sowie in metagenomische DatensĂ€tzen. FĂŒr die folgenden Problemstellungen wurden Algorithmen entwickelt: 1) Fehlerkorrektur und Verbesserung der effektiven Coverage von genomischen Rohdaten (Trowel), 2) multiple Gesamt-Genomalinierungen (whole genome alignments; WGAs) und die Detektion kleiner Unterschiede innerhalb einer Population (Kairos), 3) Identifikation struktureller Varianten (SV) (Apollo), und 4) Klassifikation von Mikroorganismen in metagenomischen DatensĂ€tzen (Poseidon). Diese Algorithmen nehmen keine Interpretation biologischer Rohdaten vor sondern stellen Ausgangspunkte fĂŒr biologische Hypothesen zur VerfĂŒgung. Auf Grund der Fortschritte in verteiltem und paralellem Rechnen nutzen viele moderne Bioinformatikalgorithmen Paralellisierung auf CPUs oder GPUs. Diese erhöhte RechenkapazitĂ€t erlaubt es uns grĂ¶ĂŸere und komplexere Probleme zu lösen. Allerdings machen diese technische Fortschritte allein es noch nicht möglich, sehr große DatensĂ€tze zu nutzen und bringen auch keine Antworten auf biologische Fragen. Um von diesen Fortschritten zu profitieren und hochqualitative Informationen aus Rohdaten extrahieren zu können, sind gut durchdachte Datenstrukturen und Algorithmen notwendig. FĂŒr die Populationsgenetik sollte eine effiziente ReprĂ€sentation eines Pan-Genoms und dazugehöriger Formeln geschaffen werden. ZusĂ€tzlich zu einer solchen ReprĂ€sentation spielen Sequenzalalinierungen eine entscheidende Rolle im Lösen biologischer Probleme wie der Berechnung von Allelfrequenzen, der Detektion seltener Varianten, der Assoziation von Genotypen und PhĂ€notypen und Inferenz von KausalitĂ€t bezĂŒglich bestimmter Krankheiten. Um Mutationen in einer Population zu detektieren wird die konventionelle Alinierungsmethode verbessert da mehrere Genome gleichzeitig aliniert werden. Obwohl die Anzahl vollstĂ€ndiger Genomsequenzen stetig gestiegen ist, ist die Analyse dieser großen und komplexen DatensĂ€tze immer noch schwierig. Die Hochdurchsatz-Sequenzierung (Next Generation Sequencing; NGS), die ein prĂ€ziseres und detaillierteres Bild der Genomik geliefert hat, ist einer der großen Fortschritte in der Biotechnologie. Die LĂ€nge und Genauigkeit der Sequenzier-Abschnitte (Reads) hat sich so weit verbessert, dass in Zukunft wahrscheinlich ein vollstĂ€ndiges Genom von nur einer einzelnen Zelle als Ausgangsmaterial rekonstruiert werden kann. Obwohl die wichtigsten Schritte zur Realisierung von Sequenzierungsfortschritten eine DomĂ€ne der Verfahrenstechnik sind, haben auch die Informatik und Computertechnik die QualitĂ€t der Sequenzen entscheidend beeinflusst. Sequenzierdaten enthalten Fehler in Form von Substitutionen, Insertionen oder Deletionen von Nukleotiden. Außerdem ist die LĂ€nge der erzeugten Reads deutlich kĂŒrzer als die eines vollstĂ€ndigen Genoms. Diese Schwierigkeiten können durch Fehlerkorrekturen und Genomassemblierung verringert werden, wodurch nachfolgende Analysen genauer werden. Programme zur Alinierung kurzer Reads waren bisher die wichtigste Methode um genetische Mutationen zu detektieren. Da nun duch neue Technologien hĂ€ufig lĂ€ngere Reads oder auch Contigs verfĂŒgbar sind, werden Kartierungsmethoden benötigt die sich an langen Ähnlichkeiten orientieren und sich nicht von kurzen lokalen Übereinstimmungen fehlleiten lassen. Die Parameter fĂŒr Programme zur Alinierung von kurzen Reads welche nichtĂŒbereinstimmende Basen und das Eröffnen und VerlĂ€ngern von LĂŒcken bestrafen, sind nicht direkt auf die Alinierung lĂ€ngerer Reads anwendbar. Alternativ können WGA-Algorithmen verwendet werden, die das Alinierungsproblem in einem lĂ€ngeren Kontext lösen und dadurch essentielle Daten fĂŒr vergleichende Studien liefern. Allerdings haben bisherige WGA-Algorithmen noch Probleme in der praktischen Anwendung fĂŒr die Populationsgenetik wegen ihrer hohen Zeit- und SpeicherkomplexitĂ€t. Außerdem wurde der Definition idealer Datenformate fĂŒr Anwendungen der komparativen Genomik nur wenig Aufmerksamkeit gewidmet. Um DatensĂ€tze großer Populationen verarbeiten zu können sollten Algorithmen fĂŒr multiple Sequenzalinierung (MSA) mit WGA-Methoden zur multiplen Gesamtgenomalinierung (MWGA) kombiniert werden. Obwohl bereits viele MWGA-Methoden vorgestellt wurden, wurde ihre Genauigkeit noch nicht aussagekrĂ€ftig ĂŒberprĂŒft. Vielmehr lieferten QualitĂ€tskontrollen sehr unterschiedliche Ergebnisse, abhĂ€ngig von der Auswahl von Organismen und verwendeten Sequenzen. Ein noch grĂ¶ĂŸeres Problem ist die ungenaue Beschreibung von Experimenten zur Messung der FunktionalitĂ€t von MWGA-Methoden. Daher war es schwierig die multiplen Alinierungs-Ergebnisse zu interpretieren. Ich beschreibe in dieser Arbeit eine deutlich umfassendere Methode um die Genauigkeit eines MWGA-Algorithmus zu bestimmen. Sie macht von vorab bekannten Positionen der Varianten Gebrauch wozu Simulationen und standardisierte Statistiken herangezogen werden. Die Metagenomik untersucht die genetische Zusammensetzung einer (oft hauptsĂ€chlich mikrobiellen) natĂŒrlichen Organismen-Gemeinschaft. Sie ist unabhĂ€ngig von der Kultivierung einzelner Mikroben und liefert auch quantitative Informationen zur Zusammensetzung der Gemeinschaft. WĂ€hrend Proben aus der Umwelt ein natĂŒrlicheres Ausgangsmaterial liefern ist gleichzeitig auch die KomplexitĂ€t der Analysen deutlich höher: die Anzahl der enthaltenen Arten kann sehr groß sein, so dass nur ein Bruchteil der Genome tatsĂ€chlich analysiert wird. Ich stelle einen Algorithmus vor, Poseidon, der Reads zur taxonomischen Identifikation mit Arten-genauer Auflösung zuordnet und damit hilft deren relative HĂ€ufigkeit in einer Probe zu quantifizieren. Die Interaktionen zwischen Bakterien kann Konflikte und auch Kooperationen hervorrufen. Die spezielle Mischung unterschiedlicher Artem kann daher eine Reihe funktionaler Anpassungen an eine bestimmte Umgebung aufzeigen. Die Zusammensetzung der Arten könnte durch biotische oder abiotische Faktoren verĂ€ndert werden, was im Kontext eines Krankheitsbildes zu einer VerĂ€nderung der AnfĂ€lligkeit eines Wirts bezĂŒglich eines bestimmten Erregers fĂŒhren kann. Daher sind die genaue Quantifizierung von Arten und die EntschlĂŒsselung ihrer funktionalen Rolle in einer bestimmten Umgebung grundlegend fĂŒr metagenomische Studien. Zusammenfassend stelle ich in dieser Arbeit fortgeschrittene bioinformatische Methoden, Trowel, Kairos, Apollo und Poseidon vor. Trowel korrigiert Fehler in Sequenzabschnitten mit Hilfe von k-mer Informationen von hoher QualitĂ€t. Kairos fĂŒhrt die Alinierung einer Sequenz zu multiplen Genomen einer Art durch. Apollo charakterisiert genomweit genetische Varianten basierend auf den Alinierungen von Kairos, und erfasst sowohl Punktmutationen als auch große strukturelle Varianten. Poseidon ordnet metagenomische DatensĂ€tze taxonomischen Identifikatoren zu. Auch wenn keine spezifischen biologischen Fragestellungen beantwortet werden, wird die Basis fĂŒr zukĂŒnftige Fragen geschaffen

    M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species

    Get PDF
    BACKGROUND: Due to recent advances in whole genome shotgun sequencing and assembly technologies, the financial cost of decoding an organism's DNA has been drastically reduced, resulting in a recent explosion of genomic sequencing projects. This increase in related genomic data will allow for in depth studies of evolution in closely related species through multiple whole genome comparisons. RESULTS: To facilitate such comparisons, we present an interactive multiple genome comparison and alignment tool, M-GCAT, that can efficiently construct multiple genome comparison frameworks in closely related species. M-GCAT is able to compare and identify highly conserved regions in up to 20 closely related bacterial species in minutes on a standard computer, and as many as 90 (containing 75 cloned genomes from a set of 15 published enterobacterial genomes) in an hour. M-GCAT also incorporates a novel comparative genomics data visualization interface allowing the user to globally and locally examine and inspect the conserved regions and gene annotations. CONCLUSION: M-GCAT is an interactive comparative genomics tool well suited for quickly generating multiple genome comparisons frameworks and alignments among closely related species. M-GCAT is freely available for download for academic and non-commercial use at:
    • 

    corecore