123 research outputs found

    Generalizations of the genomic rank distance to indels

    Get PDF
    MOTIVATION: The rank distance model represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. RESULTS: We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the Escherichia coli strains, a feature not seen in the reference tree. AVAILABILITY AND IMPLEMENTATION: Code and instructions are available at https://github.com/meidanis-lab/rank-indel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Inference of Ancestral Recombination Graphs through Topological Data Analysis

    Get PDF
    The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Gal\'apagos Islands.Comment: 33 pages, 12 figures. The accompanying software, instructions and example files used in the manuscript can be obtained from https://github.com/RabadanLab/TARGe

    Identifying Gene Clusters by Discovering Common Intervals in Indeterminate Strings

    Get PDF
    Dörr D, Stoye J, Böcker S, Jahn K. Identifying Gene Clusters by Discovering Common Intervals in Indeterminate Strings. BMC Genomics. 2014;15(Suppl. 6: Proc. of RECOMB-CG 2014): S2.Background: Comparative analyses of chromosomal gene orders are successfully used to predict gene clusters in bacterial and fungal genomes. Present models for detecting sets of co-localized genes in chromosomal sequences require prior knowledge of gene family assignments of genes in the dataset of interest. These families are often computationally predicted on the basis of sequence similarity or higher order features of gene products. Errors introduced in this process amplify in subsequent gene order analyses and thus may deteriorate gene cluster prediction. Results: In this work, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene family-based gene cluster prediction, via multiple conflicting gene family annotations, to gene family-free analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure of the genes of different genomes. We evaluate our gene family-free model against a gene family-based model on a dataset of 93 bacterial genomes. Conclusions: Our model is able to detect gene clusters that would be also detected with well-established gene family-based approaches. Moreover, we show that it is able to detect conserved regions which are missed by gene family-based methods due to wrong or deficient gene family assignments

    Adaptive Change Inferred from Genomic Population Analysis of the ST93 Epidemic Clone of Community-Associated Methicillin-Resistant Staphylococcus aureus

    Get PDF
    Community-associated methicillin-resistant Staphylococcus aureus (CA-MRSA) has emerged as a major public health problem around the world. In Australia, ST93-IV[2B] is the dominant CA-MRSA clone and displays significantly greater virulence than other S. aureus. Here, we have examined the evolution of ST93 via genomic analysis of 12 MSSA and 44 MRSA ST93 isolates, collected from around Australia over a 17-year period. Comparative analysis revealed a core genome of 2.6 Mb, sharing greater than 99.7% nucleotide identity. The accessory genome was 0.45 Mb and comprised additional mobile DNA elements, harboring resistance to erythromycin, trimethoprim, and tetracycline. Phylogenetic inference revealed a molecular clock and suggested that a single clone of methicillin susceptible, Panton-Valentine leukocidin (PVL) positive, ST93 S. aureus likely spread from North Western Australia in the early 1970s, acquiring methicillin resistance at least twice in the mid 1990s. We also explored associations between genotype and important MRSA phenotypes including oxacillin MIC and production of exotoxins (α-hemolysin [Hla], δ-hemolysin [Hld], PSMα3, and PVL). High-level expression of Hla is a signature feature of ST93 and reduced expression in eight isolates was readily explained by mutations in the agr locus. However, subtle but significant decreases in Hld were also noted over time that coincided with decreasing oxacillin resistance and were independent of agr mutations. The evolution of ST93 S. aureus is thus associated with a reduction in both exotoxin expression and oxacillin MIC, suggesting MRSA ST93 isolates are under pressure for adaptive chang

    Phylogenetic relationships of the north-eastern Atlantic and Mediterranean blenniids

    Get PDF
    The phylogenetic relationships of 27 north-eastern Atlantic and Mediterranean blennioids are analysed based on a total of 1001 bp from a combined fragment of the 12S and 16S mitochondrial rDNA. The most relevant results with implications in current blenniid taxonomy are: (1) Lipophrys pholis and Lipophrys (= Paralipophrys) trigloides are included in a well-supported clade that by the rule of precedence must be named Lipophrys; (2) the sister species of this clade are not the remaining species of the genus Lipophrys but instead a monotypic genus comprising Cory-phoblennius galerita; (3) the smaller species of Lipophrys were recovered in another well-supported and independent clade, which we propose to be recognized as Microlipophrys; (4) although some authors included the genera Salaria and Lipophrys in a single group we have never recovered such a relationship. Instead, Salaria is more closely related to the genera Scartella and Parablennius; (5) the genus Parablennius, which was never recovered as a monophyletic clade, is very diverse and may include several distinct lineages; (6) the relative position of Aidablennius sphynx casts some doubts on the currently recognized relationships between the different blenniid tribes. Meristic, morphological, behavioural and ecological characters support our results and are also discussed. The possible roles of the tropical West African coast and the Mediterranean in the diversification of blenniids are discussed. (c) 2005 The Linnean Society of London.info:eu-repo/semantics/publishedVersio

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF

    Mutational dynamics and phylogenetic utility of plastid introns and spacers in early branching eudicots

    Get PDF
    Major progress has been made during the last twenty years towards a better understanding of the evolution of angiosperms. Early molecular-phylogenetic analyses revealed three major groups, with eudicots as well as monocots being monophyletic, arisen from a paraphyletic group of dicotyledonous angiosperms (= basal angiosperms). Consistently, numerous phylogenetic studies based on sequence data have recovered the eudicot-clade and increased confidence in its existence. Furthermore this clade, which contains about 75% of angiosperm species diversity, is characterized by the possession of tricolpate and tricolpate-derived pollen and has thus also been called the tricolpate clade. Based on molecular-phylogenetic investigations several lineages, such as Ranunculales, Proteales (= Proteaceae, Nelumbonaceae, Platanaceae), Sabiaceae, Buxaceae plus Didymelaceae, and Trochodendraceae plus Tetracentraceae were shown as belonging to a early-diverging grade (early-diverging or “basal” eudicots), while larger groups like asterids, Caryophyllales, rosids, Santalales, and Saxifragales were identified as being members of a highly supported core clade, the so called “core eudicots”. Nevertheless, phylogenetic relationships among several lineages of the eudicots remained difficult to resolve. This thesis is mainly concentrated on fully resolving the branching order among the different clades of the early-diverging eudicots as well as on clarifying phylogenetic and systematic conditions within several lineages, based on phylogenetic reconstructions using sequence data of rapidly-evolving and non-coding molecular regions, such as spacers and introns. Commonly, fast-evolving and non-coding DNA was used to infer relationships among species and genera, as practised in chapter 3, due to the assumption of being inapplicable caused by putative high levels of homoplasy through multiple substitutions and frequent microstructural changes resulting in non-alignability. However, during the last few years numerous molecular-phylogenetic studies were able to present well resolved angiosperm trees on the basis of rapidly-evolving and non-coding regions from the large single copy region of the chloroplast genome comparable to multi-gene analyses concerning topology and statistical support. Mutational dynamics in spacers and introns was revealed to follow complex patterns related to structural constraints like the introns secondary structure. Therefore extreme sequence variability was always confirmed to mutational hotspots that could be excluded from calculations. Moreover it became clear that combining these non-coding regions with the fast-evolving matK gene can lead to further resolved and statistical supported trees. Chapter 1 deals with the placement of Sabiales inside the early-diverging eudicot grade, while investigating mutational dynamics as well as the utility of different kinds of non-coding and rapidly-evolving DNA within deep-level phylogenetics. It was done by analyzing a combination of nine regions from the large single copy region of the chloroplast genome, including spacers, the sole group I intron, three group II introns and the coding matK for a sampling of 56 taxa. The presented topology is in mainly congruence with the hypothesis on phylogenetic relationships among early-branching eudicots that was gained through the application of a reduced set of five non-coding and fast-evolving molecular markers, including the plastid petD (petB-petD spacer, petD group II intron) plus the trnL-F (trnL group I intron, trnL-F spacer) region and the matK gene. It showed a grade of Ranunculales, Sabiales, Proteales, Trochodendrales and Buxales. The current study differs in showing Sabiales as sister to Proteales in all phylogenetic analyses, in contrast to a second-branching inside early-diverging eudicots and a Bayesian tree displaying Sabiales branching after Proteales. All three hypotheses were tested concerning their likelihood. None of them was shown as being significantly declinable. Thus, albeit the number of characters and informative sites was doubled in comparision to the five-region investigation, the exact position of the Sabiales remained to be resolved with confidence. However, the advanced analyses of the phylogenetic structure of the three different non-coding partitions in comparison to coding genes resulted in the recognition of a significantly higher mean phylogenetic signal per informative character within spacers and introns than in the frequently applied slowly-evolving rbcL gene. The fast-evolving and well performing matK gene is shown to be nested within the non-coding partitions in this respect. Interestingly, the least constrained spacers displayed considerably less phylogenetic structure than both, the group I intron and the group II introns. Molecular evolution is again shown to follow certain patterns in angiosperms, as indicated by the occurrence of mutational hotspots and their connection to structural and functional constraints. This is especially shown for the group II introns studied where highly dynamic sequence parts were rather found in loops than stems. The aim of chapter 2 was to present a comprehensive reconstruction of the phylogenetic relationships inside the order of Ranunculales, the first-branching clade of the early-diverging eudicots, with an emphasis on the evolution of growth forms within the group. Currently, the order comprises seven families (Ranunculaceae, Berberidaceae, Menispermaceae, Lardizabalaceae, Circaeasteraceae – not included due to lacking plant material, Eupteleaceae, Papaveraceae) containing predominantly herbaceous groups as well as trees and lianescent/shrubby forms. A surprising result that emerged due to the increased use of molecular data within systematics during the last twenty years is the inclusion of the woody Eupteleaceae into Ranunculales. Because of its adaptation to wind pollination it was previously placed next to Hamamelididea. Although phylogenetic hypotheses agreed in the exclusion of Eupteleaceae and the predominantly herbaceous Papaveraceae from a core clade the branching order within early-diverging Ranunculales remained a question to be answered. Thus phylogenetic reconstructions based on molecular data of 50 taxa (including outgroup), applying the well-performing non-coding petD and trnL-F as well as the trnK/matK-psbA region including the coding matK, were carried out. The comprehensive sampling resulted in fully resolved and highly supported phylogenies in both, maximum parsimony and model based approaches, with family relations within the core clade being identical and Euptelea appearing as first branching lineage. However, the relationships among the early-diverging Ranunculales could not be resolved with confidence, a result in line with the finding made in chapter 1. The topology was further resolved as Lardizabalaceae being sister to the remaining members of the order, followed by Menispermaceae, Berberidaceae and Ranunculaceae, the latter sharing a sistergroup relationship. Inside the mainly lianescent Lardizabalaceae the shrubby Decaisnea was clearly depicted as first-branching. The systematic controversial Glaucidium and Hydrastis are shown to be early-diverging members of the Ranunculaceae. A central goal of chapter 3 was to test phylogenetic relationships among the members of the ranunculaceous tribe Anemoneae. Currently it consists of the subtribes Anemoninae including Anemone, Hepatica, Pulsatilla and Knowltonia, and Clematidinae, consisting of Archiclematis, Clematis and Naravelia. Furthermore the position and taxonomic rank of several lineages inside the subtribe Anemoninae were examined. Since recent comprehensive molecular-phylogenetic investigations have been carried out for the members of Clematidinae or Anemoninae, 63 species representing all major lineages of the two subtribes were included into analyses. Calculations were carried out on the basis of molecular data of the nuclear ribosomal ITS1&2 and the plastid atpB-rbcL intergenic spacer region. Phylogenetic reconstructions resulted in the recognition of two distinct clades within the tribe, thus corroborating the formation of the two subtribes. Within the subtribe Anemoninae the traditional genera Knowltonia, Pulsatilla and Hepatica are confidently shown to be nested within the genus Anemone. The preliminary classification of the genus, currently consisting of the two subgenera Anemone and Anemonidium, is complemented by the subgenus Hepatica

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Learning mutational graphs of individual tumour evolution from single-cell and multi-region sequencing data

    Full text link
    Background. A large number of algorithms is being developed to reconstruct evolutionary models of individual tumours from genome sequencing data. Most methods can analyze multiple samples collected either through bulk multi-region sequencing experiments or the sequencing of individual cancer cells. However, rarely the same method can support both data types. Results. We introduce TRaIT, a computational framework to infer mutational graphs that model the accumulation of multiple types of somatic alterations driving tumour evolution. Compared to other tools, TRaIT supports multi-region and single-cell sequencing data within the same statistical framework, and delivers expressive models that capture many complex evolutionary phenomena. TRaIT improves accuracy, robustness to data-specific errors and computational complexity compared to competing methods. Conclusions. We show that the application of TRaIT to single-cell and multi-region cancer datasets can produce accurate and reliable models of single-tumour evolution, quantify the extent of intra-tumour heterogeneity and generate new testable experimental hypotheses
    • …
    corecore