675 research outputs found

    Inference of Ancestral Recombination Graphs through Topological Data Analysis

    Get PDF
    The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Gal\'apagos Islands.Comment: 33 pages, 12 figures. The accompanying software, instructions and example files used in the manuscript can be obtained from https://github.com/RabadanLab/TARGe

    The tree-child network problem and the shortest common supersequences for permutations are NP-hard

    Full text link
    Reconstructing phylogenetic networks presents a significant and complex challenge within the fields of phylogenetics and genome evolution. One strategy for reconstruction of phylogenetic networks is to solve the phylogenetic network problem, which involves inferring phylogenetic trees first and subsequently computing the smallest phylogenetic network that displays all the trees. This approach capitalizes on exceptional tools available for inferring phylogenetic trees from biomolecular sequences. Since the vast space of phylogenetic networks poses difficulties in obtaining comprehensive sampling, the researchers switch their attention to inferring tree-child networks from multiple phylogenetic trees, where in a tree-child network each non-leaf node must have at least one child that is a tree node (i.e. indegree-one node). We prove that the tree-child network problem for multiple trees remains NP-hard by a reduction from the shortest common supersequnece problem for permuations and proving that the latter is NP-hard.Comment: 3 figures and 11 page

    Allopolyploid speciation and ongoing backcrossing between diploid progenitor and tetraploid progeny lineages in the Achillea millefolium species complex: analyses of single-copy nuclear genes and genomic AFLP

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the flowering plants, many polyploid species complexes display evolutionary radiation. This could be facilitated by gene flow between otherwise separate evolutionary lineages in contact zones. <it>Achillea collina </it>is a widespread tetraploid species within the <it>Achillea millefolium </it>polyploid complex (Asteraceae-Anthemideae). It is morphologically intermediate between the relic diploids, <it>A. setacea</it>-2x in xeric and <it>A. asplenifolia</it>-2x in humid habitats, and often grows in close contact with either of them. By analyzing DNA sequences of two single-copy nuclear genes and the genomic AFLP data, we assess the allopolyploid origin of <it>A. collina</it>-4x from ancestors corresponding to <it>A. setacea</it>-2x and <it>A. asplenifolia</it>-2x, and the ongoing backcross introgression between these diploid progenitor and tetraploid progeny lineages.</p> <p>Results</p> <p>In both the ncp<it>GS </it>and the <it>PgiC </it>gene tree, haplotype sequences of the diploid <it>A. setacea</it>-2x and <it>A. asplenifolia</it>-2x group into two clades corresponding to the two species, though lineage sorting seems incomplete for the <it>PgiC </it>gene. In contrast, <it>A. collina</it>-4x and its suspected backcross plants show homeologous gene copies: sequences from the same tetraploid individual plant are placed in both diploid clades. Semi-congruent splits of an AFLP Neighbor Net link not only <it>A. collina</it>-4x to both diploid species, but some 4x individuals in a polymorphic population with mixed ploidy levels to <it>A. setacea</it>-2x on one hand and to <it>A. collina</it>-4x on the other, indicating allopolyploid speciation as well as hybridization across ploidal levels.</p> <p>Conclusions</p> <p>The findings of this study clearly demonstrate the hybrid origin of <it>Achillea collina</it>-4x, the ongoing backcrossing between the diploid progenitor and their tetraploid progeny lineages. Such repeated hybridizations are likely the cause of the great genetic and phenotypic variation and ecological differentiation of the polyploid taxa in <it>Achillea millefolium </it>agg.</p

    Consensus Clusters in Robinson-Foulds Reticulation Networks

    Get PDF
    Inference of phylogenetic networks - the evolutionary histories of species involving speciation as well as reticulation events - has proved to be an extremely challenging problem even for smaller datasets easily tackled by supertree inference methods. An effective way to boost the scalability of distance-based supertree methods originates from the Pareto (for clusters) property, which is a highly desirable property for phylogenetic consensus methods. In particular, one can employ strict consensus merger algorithms to boost the scalability and accuracy of supertree methods satisfying Pareto; cf. SuperFine. In this work, we establish a Pareto-like property for phylogenetic networks. Then we consider the recently introduced RF-Net method that heuristically solves the so-called RF-Network problem and which was demonstrated to be an efficient and effective tool for the inference of hybridization and reassortment networks. As our main result, we provide a constructive proof (entailing an explicit refinement algorithm) that the Pareto property applies to the RF-Network problem when the solution space is restricted to the popular class of tree-child networks. This result implies that strict consensus merger strategies, similar to SuperFine, can be directly applied to boost both accuracy and scalability of RF-Net significantly. Finally, we further investigate the optimum solutions to the RF-Network problem; in particular, we describe structural properties of all optimum (tree-child) RF-networks in relation to strict consensus clusters of the input trees

    Computational Molecular Biology

    No full text
    Computational Biology is a fairly new subject that arose in response to the computational problems posed by the analysis and the processing of biomolecular sequence and structure data. The field was initiated in the late 60's and early 70's largely by pioneers working in the life sciences. Physicists and mathematicians entered the field in the 70's and 80's, while Computer Science became involved with the new biological problems in the late 1980's. Computational problems have gained further importance in molecular biology through the various genome projects which produce enormous amounts of data. For this bibliography we focus on those areas of computational molecular biology that involve discrete algorithms or discrete optimization. We thus neglect several other areas of computational molecular biology, like most of the literature on the protein folding problem, as well as databases for molecular and genetic data, and genetic mapping algorithms. Due to the availability of review papers and a bibliography this bibliography

    Reticulated origin of domesticated emmer wheat supports a dynamic model for the emergence of agriculture in the fertile crescent

    Get PDF
    We used supernetworks with datasets of nuclear gene sequences and novel markers detecting retrotransposon insertions in ribosomal DNA loci to reassess the evolutionary relationships among tetraploid wheats. We show that domesticated emmer has a reticulated genetic ancestry, sharing phylogenetic signals with wild populations from all parts of the wild range. The extent of the genetic reticulation cannot be explained by post-domestication gene flow between cultivated emmer and wild plants, and the phylogenetic relationships among tetraploid wheats are incompatible with simple linear descent of the domesticates from a single wild population. A more parsimonious explanation of the data is that domesticated emmer originates from a hybridized population of different wild lineages. The observed diversity and reticulation patterns indicate that wild emmer evolved in the southern Levant, and that the wild emmer populations in south-eastern Turkey and the Zagros Mountains are relatively recent reticulate descendants of a subset of the Levantine wild populations. Based on our results we propose a new model for the emergence of domesticated emmer. During a pre-domestication period, diverse wild populations were collected from a large area west of the Euphrates and cultivated in mixed stands. Within these cultivated stands, hybridization gave rise to lineages displaying reticulated genealogical relationships with their ancestral populations. Gradual movement of early farmers out of the Levant introduced the pre-domesticated reticulated lineages to the northern and eastern parts of the Fertile Crescent, giving rise to the local wild populations but also facilitating fixation of domestication traits. Our model is consistent with the protracted and dispersed transition to agriculture indicated by the archaeobotanical evidence, and also with previous genetic data affiliating domesticated emmer with the wild populations in southeast Turkey. Unlike other protracted models, we assume that humans played an intuitive role throughout the process.Natural Environment Research Council [NE/E015948/1]; Slovak Research and Development Agency [APVV-0661-10, APVV-0197-10]info:eu-repo/semantics/publishedVersio

    Genomics-based re-examination of the taxonomy and phylogeny of human and simian Mastadenoviruses: an evolving whole genomes approach, revealing putative zoonosis, anthroponosis, and amphizoonosis

    Get PDF
    With the advent of high-resolution and cost-effective genomics and bioinformatics tools and methods contributing to a large database of both human (HAdV) and simian (SAdV) adenoviruses, a genomics-based re-evaluation of their taxonomy is warranted. Interest in these particular adenoviruses is growing in part due to the applications of both in gene transfer protocols, including gene therapy and vaccines, as well in oncolytic protocols. In particular, the re-evaluation of SAdVs as appropriate vectors in humans is important as zoonosis precludes the assumption that human immune system may be na₏ıve to these vectors. Additionally, as impor- tant pathogens, adenoviruses are a model organism system for understanding viral pathogen emergence through zoonosis and anthroponosis, particularly among the primate species, along with recombination, host adaptation, and selection, as evidenced by one long-standing human respiratory pathogen HAdV-4 and a recent re-evaluation of another, HAdV-76. The latter reflects the insights on amphizoonosis, defined as infections in both directions among host species including “other than human”, that are pos- sible with the growing database of nonhuman adenovirus genomes. HAdV-76 is a recombinant that has been isolated from human, chimpanzee, and bonobo hosts. On-going and potential impacts of adenoviruses on public health and translational medicine drive this evaluation of 174 whole genome sequences from HAdVs and SAdVs archived in GenBank. The conclusion is that rather than separate HAdV and SAdV phylogenetic lineages, a single, intertwined tree is observed with all HAdVs and SAdVs forming mixed clades. Therefore, a single designation of “primate adenovirus” (PrAdV) superseding either HAdV and SAdV is proposed, or alter- natively, keeping HAdV for human adenovirus but expanding the SAdV nomenclature officially to include host species identifica- tion as in ChAdV for chimpanzee adenovirus, GoAdV for gorilla adenovirus, BoAdV for bonobo adenovirus, and ad libitum

    Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning

    Get PDF
    Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets. The main contribution of this paper is the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. This is one of the first applications of machine learning to phylogenetic studies, and we show its promise with a proof-of-concept experimental study conducted on both simulated and real data consisting of binary trees with no missing taxa
    • 

    corecore